Submission: Group 23: Bank Marketing Prediction

Submitting authors: @mmaidana24318, @stevenlio88, @ZherenXu

Repository: https://github.com/UBC-MDS/Bank_Marketing_Prediction Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/Bank_Marketing_Prediction/blob/main/doc/bank_marketing_prediction_report.html Abstract/executive summary: In this project, we attempt to build a classification model comparing Random Forest and Logistics Regression to predict and improve the response rate of the banking customer's response to a telemarketing campaign if they were to be contacted by the bank through phone calls. The final model chosen is Logistics Regression after hyper-parameter tuning and cross-validation using the training data. The Logistic Regression was chosen based on its performance as well as the interpretability of the regression model for analyses. The final model performed well on unseen test data where it achieved an overall accuracy of 86.1%, the model successfully recalled 90.3% of the positive response but it is incorrectly predicted 12.1% of cases as false positive. Although the number of false-positive cases is not ideal, in terms of running a telemarketing campaign once we prioritize based on the model's predicted probability from high to low given a limited campaign budget, given the model's high recall rate, we are confident that we will be getting the customers who most likely will respond first and the false-positive cases will just need more persuasion for cold caller later on. But the benefit of using Logistic Regression is so that we can be more precise on the types of customers such as previous campaign responders we should go after first and when to contact them such as during March, August, and October.

The data set used in this project is related to direct marketing campaigns (phone calls) of a Portuguese banking institution (Moro, Cortez, and Rita 2014). The data set contains 20 features, plus the desired target. Each row contains information of one client, including personal and banking attributes, and data related to the past interactions with the telemarketer. The data set presents class imbalance, since only about 11% of the records are targeted as positive (meaning that the customer responded to the telemarketing offer). If possible, future studies will include new information such as the reason for the customer's last contact, customer's tenure with the bank or customer's overall value (in terms of revenue) to the bank to further improve the ROI of the telemarketing campaign.

Editor: @mmaidana24318, @stevenlio88, @ZherenXu Reviewer: Paniz Fazlali, Luke Collins, Andy Yang

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: LukeAC (Luke Collins)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

Introduction:

The objective of this project is to identify which customers are more likely to respond positively to a telemarketing campaign and subscribe to a new product (a long-term deposit). To address [this] predictive question...

I'm not so sure this is in fact a predictive question. After reading the analysis, this question - phrased another way - sounds to me like "The objective of this project is to identify customer demographic profiles/attributes which have high association with long-term bank deposits being made when prompted via telemarketing campaign". Overall the introduction could be a little more concise and less vague in its explanation of the project.

Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately?
Model building and selection, Analysis and Results Discussion, Limitations, Conclusion
- Figure 4. X and Y axes labels are flipped - but I like this plot.
- This section emphasizes something I mentioned earlier, which is that we aren't really dealing with a 'predictive' question; this feels more like an exploratory analysis of which customer profile attributes seem to be most associated with a 'positive outcome'.
- Not sure I understand why the following is the case; aren't the records in the dataset associated with a call/telemarketing interaction with the bank?
  
  Many of the most important features identified by the model, such as the month of contact or the duration of the call, are unknown for customers that had no record of previous interactions with the bank.
- I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations.
Perhaps you ought to include a environment yaml file to make it easier to install project dependencies. The format in which your dependencies are displayed are not conducive to setting up a local project environment.
Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

No tests of methods/functions are available. Then again, I don't know that it was clearly indicated with a reasonable amount of time that this was an expectation of the project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: paradise1260

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5 h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

What I found interesting:

It was nice that you have the flowchart of the pipeline. It helped me understand the project better.
The question was very clear and you clearly answered the question in the project.
It was nice that your scripts print useful messages in the terminal.
It is helpful that you have a section named What's Changed in your Releases.
I enjoyed the fact that you explicitly explained what kind of transformation you applied on your data.

What might need improvement:

The link to the data in the final report and in the README does not work.
Although the preprocessing script worked, I got a warning "A value is trying to be set on a copy of a slice from a DataFrame". You might want to address that.
An environment.yml file might help creating an environment. I did not have some of the packages installed and I had to go through the installation.
I think it is a good idea to have a plot in your report showing the distribution of features coloured by the target or a plot showing the correlation between the features. You can show if the most important features based on the model match the important features you found in your initial EDA.
You might want to justify more why you picked logistic regression over random forest classifier. From my point of view, recall is more important here, and random forest classifier performed better in terms of recall score.

Data analysis review checklist

Reviewer: AndyYang80 (Andy Yang)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Positives

The scope and objective of the project is very well defined and specific, which allows the reader to clearly gauge the effectiveness of the model
The methods section is very in depth and it makes it easy for the user to see what manipulation and analysis is being done on the data
The limitations section shows the reader the potential pitfalls of the model, which prevents over-application of the model

Potential Improvements

I believe that the problem can be re-phrased to focus more on exploratory/causal analysis rather than prediction. The phrase "lead to more effective strategies" part of your README introduction leads me to believe that the focus of the study should on obtaining the significant features which lead to a sale, rather than predicting if a certain customer will purchase an item. This is because prediction is not really actionable since the company would already know if a customer purchased an item when the sale is made, while explanatory analysis is more actionable (i.e. the company can tell its employees to focus on call length). You have the description for the explanatory question there in the readme, but the focus should just be changed away from prediction.
It seems like the link to the dataset isn't really working in the "Data" section, this may be something to look at
An environment file may be helpful (although we will likely incorporate a docker image in the next milestones so this may not be necessary)
Perhaps more explanation can be given to the metrics that are valued. Since this is a telemarketing problem, you can probably add a small blurb suggesting that you want to maximize the f1 score, since both false positives and false negatives will ultimately cost the company in lost time and money.
It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks everyone for your valuable input so that we can improve our project. Please find below some of the changes we have implemented based on your feedback:

1) the license should be copyrighted to your names not MDS {feedback from TA} FIX: License file updated. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/14124fb9f29dc2ae82e538f6a224b962eaa83244

2) Rephrase into a question, something like: Will a customer subscribe to a new product if contacted? {feedback from TA and peers} FIX: Updated report to make question explicit. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/7a6528a9aa1dcb742fca9e832bae8ee05df74a73

3) Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately? {feedback from peer review} Fix: Replaced Table 1 with a simple histogram and removed original boxplot. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/d83bc76c25d83903215962250e2ebbb63e27894b

4) I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations. {feedback from peer review} FIX: An attribute table is added to the final report and include description and data types to each attribute. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17

5) It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer. {feedback from peer review} FIX: Added a section for the bottom 10 coefficients and discuss what they mean in the final report. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17

UBC-MDS / data-analysis-review-2021