Submission: GROUP 21: Identifying the Top Three Predictors of Term Deposit Subscriptions

Submitting authors: @jy1909 @JohnShiuMK @zth96 @zgarciaj

Repository: https://github.com/UBC-MDS/group21_top-three-predictors-of-term-deposit-subscriptions Report link: https://ubc-mds.github.io/group21_top-three-predictors-of-term-deposit-subscriptions/term_deposit_report.html Abstract/executive summary: This report presents an analysis of the factors influencing client subscriptions to term deposits at a Portuguese banking institution. Utilizing a dataset comprising 45,211 client interactions with a target variable and 16 input features, we apply logistic regression and decision tree classifiers to identify the top three predictors of term deposit subscriptions. The data preprocessing involves handling missing values, encoding categorical variables, and standardizing numerical variables. Our exploratory data analysis leverages visualizations to understand feature distributions and correlations, while model evaluation focuses on precision and recall due to the dataset’s imbalance. Logistic regression is likely to prove slightly superior in precision to the decision tree classifier. The analysis identifies the outcome of previous campaigns, the month of contact, and the call duration as the most significant predictors. These findings offer valuable insights into the decision-making process of clients regarding term deposit subscriptions and suggest areas for future research.

Editor: @ttimbers Reviewer: Ben Chen, Waleed Mahmood, Aishwarya Nadimpally, Nasim Ghazanfari Nasrabadi

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @phchen5 Ben Chen

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Overall, excellent work! The file organization is well-structured, making navigation effortless. The codebase is both readable and concise, contributing to its clarity. Additionally, the report is polished, presenting information in a clean and articulate manner. Still, here are a few suggestions that might further elevate your project:

Regarding the two .ipynb notebooks and their corresponding .html files, clarifying the distinction between them in the README.md could enhance clarity. It might be beneficial to explain their purposes or which one represents the final deliverable. Additionally, reconsider rendering term_deposit_full_analysis.ipynb into an .html file if its contents are already encompassed in the other HTML file. Moreover, there seems to be an error within the term_deposit_full_analysis.html file under docs/ that might need attention.
The dataset names like bank-additional-full.csv and bank-additional.csv could benefit from a documentation outlining their content or distinctions. This would offer clarity and help users understand the differences between these datasets.
In the report, delving a bit deeper into the rationale behind choosing logistic regression and decision trees—whether it's due to their interpretability, simplicity, or other factors—would provide valuable insight.
Discussing the limitations of using Logistic Regression to analyze feature importance in the report would be beneficial. Factors like its assumptions of linearity and feature independence might be worth mentioning.
Improving the visualization of the Job Type bar graph by sorting the bars would enhance its aesthetic appeal and make it more intuitive for readers.
Renaming Unnamed: 0 to a more descriptive name and excluding it from the heatmap might prevent it from overshadowing other correlations, particularly with pdays and previous.
Lastly, double-checking the references, particularly the first one, for missing DOI information would ensure consistency and completeness in the reference section.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Aishwarya120111

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hr

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The report is impressively well-written and organized, which made it quite easy for me to understand and follow. There isn't much to improve, but a few minor adjustments could be made.

In the repository, I can see that you have multiple files of .ipynb and .html. I think having final report in a single folder will suffice. If your use-case require multiple notebooks, it would be more clear if you mention about the files.
In the .html report, the table formatting will help to add more beauty to your page. The Unnamed : 0 column can be ignored or renamed to a meaningful name.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @WaleedMahmood1

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2.5 hrs

Review Comments:

Overall: Good job on the project! It was a great read and it is very interesting the way that insights can be drawn from predictions on subscriptions to term deposits. The project repository is very well structured and it was easy to navigate to find what I was looking for. In addition to these, there is an abundance of information which gives me enough background to be able to understand the purpose of the methods being implemented for the reasons that they are.

Constructive Feedback:

I believe that matplotlib is not being used in any of the scripts or analysis code as you are using altair. It might be better to remove it so that people replicating your analysis are not installing any libraries that might not be used.
There are .html and .ipynb files placed in the src folder and in the report folder. Referencing Tiffany’s example repository, having duplicates placed in the src folder is not necessary. Perhaps you might have been running the code there earlier and might have missed removing them. Just highlighting this so that your files are not repeated, and the file placement in your project repository is perfect to the dot.
There is a lot of technical terminology being used in the report. A suggestion is to explain all of the technical terms being used, or perhaps minimize the use of technical terms in the final report.
In the notebook src/term_deposit_report.ipynb when I select “Restart Kernel and Run All Cells...” from the “Kernel” menu; however, the second code block where the data is being loaded up, there is an error in running the command, saying No such file or directory: '../data/bank-full.csv'. I believe this can be resolved by changing the path in the code to ../data/raw/bank-full.csv.

Once again, these are quite minor issues. Great job on the project!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

The analysis provided is generally well-structured and informative. The use of logistic regression and decision tree classifier for model development allows for a comparison between linear and non-linear modeling techniques. Model Evaluation: The focus on accuracy, precision, and recall metrics, with a particular emphasis on precision, aligns with the study's objective to minimize Type 1 errors. However, here are a few points where improvements or clarifications could be made:

Variable Transformation Description and Metrics Description: When describing the preprocessing phase, it would be helpful to provide more context or reasoning behind the choice of transformations. Also, when describing scoring metrics, it would help if you provide more explanations on their definition and differences for those who are not familiar with these terms.

Limitations Section: While the limitations section is comprehensive, it might be beneficial to provide potential solutions or considerations for addressing some of the identified limitations.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023