Open stevenlio88 opened 2 years ago
1.5
The objective of this project is to identify which customers are more likely to respond positively to a telemarketing campaign and subscribe to a new product (a long-term deposit). To address [this] predictive question...
I'm not so sure this is in fact a predictive question. After reading the analysis, this question - phrased another way - sounds to me like "The objective of this project is to identify customer demographic profiles/attributes which have high association with long-term bank deposits being made when prompted via telemarketing campaign". Overall the introduction could be a little more concise and less vague in its explanation of the project.
Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately?
Model building and selection, Analysis and Results Discussion, Limitations, Conclusion
Many of the most important features identified by the model, such as the month of contact or the duration of the call, are unknown for customers that had no record of previous interactions with the bank.
Model building and selection
section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations.Perhaps you ought to include a environment yaml file to make it easier to install project dependencies. The format in which your dependencies are displayed are not conducive to setting up a local project environment.
Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
No tests of methods/functions are available. Then again, I don't know that it was clearly indicated with a reasonable amount of time that this was an expectation of the project.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1.5 h
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
What I found interesting:
What might need improvement:
Positives
Potential Improvements
I believe that the problem can be re-phrased to focus more on exploratory/causal analysis rather than prediction. The phrase "lead to more effective strategies" part of your README introduction leads me to believe that the focus of the study should on obtaining the significant features which lead to a sale, rather than predicting if a certain customer will purchase an item. This is because prediction is not really actionable since the company would already know if a customer purchased an item when the sale is made, while explanatory analysis is more actionable (i.e. the company can tell its employees to focus on call length). You have the description for the explanatory question there in the readme, but the focus should just be changed away from prediction.
It seems like the link to the dataset isn't really working in the "Data" section, this may be something to look at
An environment file may be helpful (although we will likely incorporate a docker image in the next milestones so this may not be necessary)
Perhaps more explanation can be given to the metrics that are valued. Since this is a telemarketing problem, you can probably add a small blurb suggesting that you want to maximize the f1 score, since both false positives and false negatives will ultimately cost the company in lost time and money.
It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thanks everyone for your valuable input so that we can improve our project. Please find below some of the changes we have implemented based on your feedback:
1) the license should be copyrighted to your names not MDS {feedback from TA} FIX: License file updated. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/14124fb9f29dc2ae82e538f6a224b962eaa83244
2) Rephrase into a question, something like: Will a customer subscribe to a new product if contacted? {feedback from TA and peers} FIX: Updated report to make question explicit. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/7a6528a9aa1dcb742fca9e832bae8ee05df74a73
3) Data: The report shows a Table 1 containing Yes/No values, presumably answering the question of 'did the contacted customer make a long-term deposit as a result of a call that was a part of the subject telemarketing campaign?'. It should be made a little more clear as to what this 'yes/no' value is actually referring; was there a minimum threshold of deposit-amount that was considered by the study/data-collectors? The visualization - box plot in combination with size-of-points to represent count of records - could probably be more easily understood as a simple histogram(s). Also - it looks like the distribution of age of respondents is distinctly bimodal, is it worth treating these groups separately? {feedback from peer review} Fix: Replaced Table 1 with a simple histogram and removed original boxplot. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/d83bc76c25d83903215962250e2ebbb63e27894b
4) I think a more explicit description of the features used/available in this dataset would allow for reviewers to offer more comments/criticism on the conclusions drawn from the analysis. Perhaps a different way of representing the information at the beginning of the Model building and selection section would be a table of features, their descriptions, and 'type'/classification of feature along with their associated treatment/transformation. This could then be supplemented by the points which describe global/cross-feature treatments/transformations. {feedback from peer review} FIX: An attribute table is added to the final report and include description and data types to each attribute. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17
5) It might be helpful to also include a graph of the most negative coefficients in the logistic regression model. This will provide the company information on what call features to avoid when contacting a customer. {feedback from peer review} FIX: Added a section for the bottom 10 coefficients and discuss what they mean in the final report. https://github.com/UBC-MDS/Bank_Marketing_Prediction/commit/9ea598d4ea2e8f56bc8b54dd4c7ada0f62236b17
Submitting authors: @mmaidana24318, @stevenlio88, @ZherenXu
Repository: https://github.com/UBC-MDS/Bank_Marketing_Prediction Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/Bank_Marketing_Prediction/blob/main/doc/bank_marketing_prediction_report.html Abstract/executive summary: In this project, we attempt to build a classification model comparing Random Forest and Logistics Regression to predict and improve the response rate of the banking customer's response to a telemarketing campaign if they were to be contacted by the bank through phone calls. The final model chosen is Logistics Regression after hyper-parameter tuning and cross-validation using the training data. The Logistic Regression was chosen based on its performance as well as the interpretability of the regression model for analyses. The final model performed well on unseen test data where it achieved an overall accuracy of 86.1%, the model successfully recalled 90.3% of the positive response but it is incorrectly predicted 12.1% of cases as false positive. Although the number of false-positive cases is not ideal, in terms of running a telemarketing campaign once we prioritize based on the model's predicted probability from high to low given a limited campaign budget, given the model's high recall rate, we are confident that we will be getting the customers who most likely will respond first and the false-positive cases will just need more persuasion for cold caller later on. But the benefit of using Logistic Regression is so that we can be more precise on the types of customers such as previous campaign responders we should go after first and when to contact them such as during March, August, and October.
The data set used in this project is related to direct marketing campaigns (phone calls) of a Portuguese banking institution (Moro, Cortez, and Rita 2014). The data set contains 20 features, plus the desired target. Each row contains information of one client, including personal and banking attributes, and data related to the past interactions with the telemarketer. The data set presents class imbalance, since only about 11% of the records are targeted as positive (meaning that the customer responded to the telemarketing offer). If possible, future studies will include new information such as the reason for the customer's last contact, customer's tenure with the bank or customer's overall value (in terms of revenue) to the bank to further improve the ROI of the telemarketing campaign.
Editor: @mmaidana24318, @stevenlio88, @ZherenXu Reviewer: Paniz Fazlali, Luke Collins, Andy Yang