UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 23: Bank Marketing Prediction #12

Open stevenlio88 opened 2 years ago

stevenlio88 commented 2 years ago

Submitting authors: @mmaidana24318, @stevenlio88, @ZherenXu

Repository: https://github.com/UBC-MDS/Bank_Marketing_Prediction Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/Bank_Marketing_Prediction/blob/main/doc/bank_marketing_prediction_report.html Abstract/executive summary: In this project, we attempt to build a classification model comparing Random Forest and Logistics Regression to predict and improve the response rate of the banking customer's response to a telemarketing campaign if they were to be contacted by the bank through phone calls. The final model chosen is Logistics Regression after hyper-parameter tuning and cross-validation using the training data. The Logistic Regression was chosen based on its performance as well as the interpretability of the regression model for analyses. The final model performed well on unseen test data where it achieved an overall accuracy of 86.1%, the model successfully recalled 90.3% of the positive response but it is incorrectly predicted 12.1% of cases as false positive. Although the number of false-positive cases is not ideal, in terms of running a telemarketing campaign once we prioritize based on the model's predicted probability from high to low given a limited campaign budget, given the model's high recall rate, we are confident that we will be getting the customers who most likely will respond first and the false-positive cases will just need more persuasion for cold caller later on. But the benefit of using Logistic Regression is so that we can be more precise on the types of customers such as previous campaign responders we should go after first and when to contact them such as during March, August, and October.

The data set used in this project is related to direct marketing campaigns (phone calls) of a Portuguese banking institution (Moro, Cortez, and Rita 2014). The data set contains 20 features, plus the desired target. Each row contains information of one client, including personal and banking attributes, and data related to the past interactions with the telemarketer. The data set presents class imbalance, since only about 11% of the records are targeted as positive (meaning that the customer responded to the telemarketing offer). If possible, future studies will include new information such as the reason for the customer's last contact, customer's tenure with the bank or customer's overall value (in terms of revenue) to the bank to further improve the ROI of the telemarketing campaign.

Editor: @mmaidana24318, @stevenlio88, @ZherenXu Reviewer: Paniz Fazlali, Luke Collins, Andy Yang

LukeAC commented 2 years ago

Data analysis review checklist

Reviewer: LukeAC (Luke Collins)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5

Review Comments:

  1. Introduction:

    The objective of this project is to identify which customers are more likely to respond positively to a telemarketing campaign and subscribe to a new product (a long-term deposit). To address [this] predictive question...

I'm not so sure this is in fact a predictive question. After reading the analysis, this question - phrased another way - sounds to me like "The objective of this project is to identify customer demographic profiles/attributes which have high association with long-term bank deposits being made when prompted via telemarketing campaign". Overall the introduction could be a little more concise and less vague in its explanation of the project.

  1. Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately?

  2. Model building and selection, Analysis and Results Discussion, Limitations, Conclusion

    • Figure 4. X and Y axes labels are flipped - but I like this plot.
    • This section emphasizes something I mentioned earlier, which is that we aren't really dealing with a 'predictive' question; this feels more like an exploratory analysis of which customer profile attributes seem to be most associated with a 'positive outcome'.
    • Not sure I understand why the following is the case; aren't the records in the dataset associated with a call/telemarketing interaction with the bank?

      Many of the most important features identified by the model, such as the month of contact or the duration of the call, are unknown for customers that had no record of previous interactions with the bank.

    • I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations.
  3. Perhaps you ought to include a environment yaml file to make it easier to install project dependencies. The format in which your dependencies are displayed are not conducive to setting up a local project environment.

  4. Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

No tests of methods/functions are available. Then again, I don't know that it was clearly indicated with a reasonable amount of time that this was an expectation of the project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

paradise1260 commented 2 years ago

Data analysis review checklist

Reviewer: paradise1260

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

What I found interesting:

  1. It was nice that you have the flowchart of the pipeline. It helped me understand the project better.
  2. The question was very clear and you clearly answered the question in the project.
  3. It was nice that your scripts print useful messages in the terminal.
  4. It is helpful that you have a section named What's Changed in your Releases.
  5. I enjoyed the fact that you explicitly explained what kind of transformation you applied on your data.

What might need improvement:

  1. The link to the data in the final report and in the README does not work.
  2. Although the preprocessing script worked, I got a warning "A value is trying to be set on a copy of a slice from a DataFrame". You might want to address that.
  3. An environment.yml file might help creating an environment. I did not have some of the packages installed and I had to go through the installation.
  4. I think it is a good idea to have a plot in your report showing the distribution of features coloured by the target or a plot showing the correlation between the features. You can show if the most important features based on the model match the important features you found in your initial EDA.
  5. You might want to justify more why you picked logistic regression over random forest classifier. From my point of view, recall is more important here, and random forest classifier performed better in terms of recall score.
AndyYang80 commented 2 years ago

Data analysis review checklist

Reviewer: AndyYang80 (Andy Yang)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Positives

Potential Improvements

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

mmaidana24318 commented 2 years ago

Thanks everyone for your valuable input so that we can improve our project. Please find below some of the changes we have implemented based on your feedback:

1) the license should be copyrighted to your names not MDS {feedback from TA} FIX: License file updated. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/14124fb9f29dc2ae82e538f6a224b962eaa83244

2) Rephrase into a question, something like: Will a customer subscribe to a new product if contacted? {feedback from TA and peers} FIX: Updated report to make question explicit. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/7a6528a9aa1dcb742fca9e832bae8ee05df74a73

3) Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately? {feedback from peer review} Fix: Replaced Table 1 with a simple histogram and removed original boxplot. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/d83bc76c25d83903215962250e2ebbb63e27894b

4) I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations. {feedback from peer review} FIX: An attribute table is added to the final report and include description and data types to each attribute. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17

5) It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer. {feedback from peer review} FIX: Added a section for the bottom 10 coefficients and discuss what they mean in the final report. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17