Milestone 4 - Fixes after eedback from TA & Peer Reviews

Hi - I compiled a list of the changes we did after the feedback received. Please review it and add anything else missing. Thanks!

Feedback from TA (Milestone 1)
[x] missing a data directory
[x] the licence should be copyrighted to your names not MDS (it is your work)
[x] Rephrase into a question, something like: Will a customer subscribe to a new product if contacted?
[ ] Be specific about what kind of visualization do you plan to make
[x] Regression isn't the appropriate analysis for a classification question (with the exception of logistic regression)
[x] comment to explain additional functions (RE: downloader.py)
[x] Caption figures with some description
[x] What is your interpretation of the preliminary analysis? Do any predictors stand out as useful?
[x] Will you address the class imbalance?
[ ] a lot of commits just say "update file", be descriptive
[ ] include a link to the tag 0.0.1
Feedback from Peer Review 1 (Milestone 2)
[x] Introduction: "The objective of this project is to identify which customers are more likely to respond positively to a telemarketing campaign and subscribe to a new product (a long-term deposit). To address [this] predictive question..." I'm not so sure this is in fact a predictive question. After reading the analysis, this question - phrased another way - sounds to me like "The objective of this project is to identify customer demographic profiles/attributes which have high association with long-term bank deposits being made when prompted via telemarketing campaign". Overall the introduction could be a little more concise and less vague in its explanation of the project.
[x] Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately? Model building and selection, Analysis and Results Discussion, Limitations, Conclusion:
[x] Figure 4. X and Y axes labels are flipped - but I like this plot.
[x] This section emphasizes something I mentioned earlier, which is that we aren't really dealing with a 'predictive' question; this feels more like an exploratory analysis of which customer profile attributes seem to be most associated with a 'positive outcome'.
[x] Not sure I understand why the following is the case; aren't the records in the dataset associated with a call/telemarketing interaction with the bank? " Many of the most important features identified by the model, such as the month of contact or the duration of the call, are unknown for customers that had no record of previous interactions with the bank."
[x] I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations.
[ ] Perhaps you ought to include a environment yaml file to make it easier to install project dependencies. The format in which your dependencies are displayed are not conducive to setting up a local project environment.
[x] No tests of methods/functions are available. Then again, I don't know that it was clearly indicated with a reasonable amount of time that this was an expectation of the project.
Feedback from Peer Review 2 (Milestone 2)
[x] The link to the data in the final report and in the README does not work.
[ ] Although the preprocessing script worked, I got a warning "A value is trying to be set on a copy of a slice from a DataFrame". You might want to address that.
[ ] An environment.yml file might help creating an environment. I did not have some of the packages installed and I had to go through the installation.
[ ] I think it is a good idea to have a plot in your report showing the distribution of features coloured by the target or a plot showing the correlation between the features. You can show if the most important features based on the model match the important features you found in your initial EDA.
[x] You might want to justify more why you picked logistic regression over random forest classifier. From my point of view, recall is more important here, and random forest classifier performed better in terms of recall score.
Feedback from Peer Review 3 (Milestone 2)
[x] I believe that the problem can be re-phrased to focus more on exploratory/causal analysis rather than prediction. The phrase "lead to more effective strategies" part of your README introduction leads me to believe that the focus of the study should on obtaining the significant features which lead to a sale, rather than predicting if a certain customer will purchase an item. This is because prediction is not really actionable since the company would already know if a customer purchased an item when the sale is made, while explanatory analysis is more actionable (i.e. the company can tell its employees to focus on call length). You have the description for the explanatory question there in the readme, but the focus should just be changed away from prediction.
[x] It seems like the link to the dataset isn't really working in the "Data" section, this may be something to look at
[ ] An environment file may be helpful (although we will likely incorporate a docker image in the next milestones so this may not be necessary)
[x] Perhaps more explanation can be given to the metrics that are valued. Since this is a telemarketing problem, you can probably add a small blurb suggesting that you want to maximize the f1 score, since both false positives and false negatives will ultimately cost the company in lost time and money.
[x] It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer.

These are the fixes from this PR #60 : (9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17) Here is the issues for these fixes: https://github.com/UBC-MDS/data-analysis-review-2021/issues/12

From Peer Review 1 (Milestone 2)

[x] I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations.

FIXES: An attribute table is added to the final report and include description and data types to each attribute.

From Peer Review 2 (Milestone 2)

[x] The link to the data in the final report and in the README does not work.

FIXES: Update link to data to exclude path to ZIP file (GitHub does not like that), Also added a link to the data source page.

From Peer Review 3 (Milestone 2)

[x] Perhaps more explanation can be given to the metrics that are valued. Since this is a telemarketing problem, you can probably add a small blurb suggesting that you want to maximize the f1 score, since both false positives and false negatives will ultimately cost the company in lost time and money.

FIXES: In the hyper-parameter tuning section added an explanation why we choose f1 (optimize both recall + precision) as the scoring metric.

[x] I believe that the problem can be re-phrased to focus more on exploratory/causal analysis rather than prediction. The phrase "lead to more effective strategies" part of your README introduction leads me to believe that the focus of the study should on obtaining the significant features which lead to a sale, rather than predicting if a certain customer will purchase an item. This is because prediction is not really actionable since the company would already know if a customer purchased an item when the sale is made, while explanatory analysis is more actionable (i.e. the company can tell its employees to focus on call length). You have the description for the explanatory question there in the readme, but the focus should just be changed away from prediction.

FIXES: The project object is restated to emphasize that the prediction model and analysis do not conflict. The model can be used as a tool to predict and select individual customer, analysis help bank to understand what group to prioritize to target.

[x] It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer.

FIXES: Added a section for the bottom 10 coefficients and discuss what they mean in the final report.

UBC-MDS / Bank_Marketing_Prediction

Milestone 4 - Fixes after eedback from TA & Peer Reviews #56