Submission: GROUP_8: Bank Marketing Analysis

Submitting authors: @Rachel0619 @rafecchang @AnuBanga @killerninja8 Sid Grover

Repository: https://github.com/UBC-MDS/dsci_522_group_8_bank_marketing_project Report link: https://ubc-mds.github.io/dsci_522_group_8_bank_marketing_project/bank_analysis.html Abstract/executive summary: Here we build a model of balanced SVC to try to predict if a new client will subscribe to a term deposit. We tested five different classification models, including dummy classifier, unbalanced/balanced logistic regression, and unbalanced/balanced SVC, and chose the optimal model of balanced SVC based on how the model scored on the test data; the model has the highest test recall score of 0.82, which indicates that the model makes the least false negative predictions among all five models.

The balanced support vector machines model considers 13 different numerical/ categorical features of customers. After hyperparameter optimization, the model’s test accuracy increased from 0.82 to 0.875. The results were somewhat expected, given SVC’s known efficacy in classification tasks, particularly when there’s a clear margin of separation. The high recall score of 0.875 indicates that the model is particularly adept at identifying clients likely to subscribe, which was the primary goal. It’s noteworthy that such a high recall was achieved, as it suggests the model is highly sensitive to true positive cases.

Editor: @Rachel0619 Reviewer: Angela Chen, Oak Chongfeungprinya, Iris Luo, Nicole Bidwell

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: angelachenmo

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5hours

Review Comments:

Hello!

Group 8 members, the project overall looks very good to me, and I especially like the way you rendered the GitHub page which is nice and neat. Good job! If I am here to be extra critical, I would like to introduce some of my findings that might be helpful to improve your project as a whole. Please kindly read the below and let me know if you have any follow-ups:

EDA: The EDA section looks nice and it effectively communicates the question to answer but if we want to show it more effectively I would recommend doing a correlation matrix to better illustrate the correlation relationship.
REPORT: The final report is very descriptive, and fully covered the content and questions asked. I would suggest using more consistent and automatic tools like glue that we were taught in class.
TEST: I also found that there is an error when I try to run pytest tests/* to test the functions.
SCRIPT: The issue when running the Python command for python scripts/optimization.py ... is happening for me too.

Other than that It looks pretty decent to me, again, the topic is very valuable and the analysis is significant enough to answer the questions asked. Well done overall!

Cheers, Angela Chen

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: nicolebid

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Hello! Your analysis was engaging to review. I found the report to be thorough and clear. It provided enough information to understand the analysis, along with clear justifications within your methodologies. In terms of your repo, I found it to have really good organization, especially the data directory. Most of your scripts also were detailed with clear documentation and useful comments. This was helpful for understanding what each script does. Overall, the README is well-structured with instructions that can be followed.

Here are my suggestions/areas of improvement. Note: some of the points are minuscule and more on the optional side, but I thought I'd include them incase you find them beneficial.

SUGGESTIONS/IMPROVEMENTS:

I believe there is an overlap of some files in src and scripts (ex: two different files named optimization.py. A bit unclear if that was intended, I'm not sure if both are being used.)
README.md:
- minor spelling error on line 51. Could also mention here to open the Docker Desktop app first (before running docker pull ...).
- Dependencies: Include a link to the Dockerfile instead of environment.yml (for clarity and full transparency since usage instructions uses docker container)
- line 62: could specify to type in cd work for clarity of where the root of the directory is (when using container)
CONTRIBUTING.md: detailed and well done. Could add how to contact for general 'seeking support'
References:
- could include references for software
Automation: the very last script to run in the cmd line (below # Optimization and Accuracy/Recall Scores) was throwing an error. I tried it both in the container and in the virtual environment and both gave errors. Perhaps, see if the other reviewers were able to get this working, in case it's on my end.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @sivakornchong

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies? Note: If the installation guide is to use Docker, dependencies should include Docker and the relevant packages in the Dockerfile instead of pointing towards environment.yml
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support Note: there is a document "Contributing.md." It is clear that if a third party want to contribute, one can create a pull request and requireing two existing team members to merge. A suggestion would be to list down the names of original team members in this document too.

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The analysis has excellent explanation on the background, and the logical flow of the whole report is sound. I really appreciate the details that explains on model selection, and why recall is prioritized in this business context. In general, the report is well written. A small note on the EDA portion, for 'previous' parameter, there seems to be only one bar, and the reader might not be able to infer much from this chart.

On the model selection and optimization, the logic is reasonable. Based on the scores shown, I can follow and understand why svc_bal is the final model chosen. Not sure if I missed this somewhere, but it will be good for me to know the distribution of target variable in the training dataset.

Code is well written. A small suggestion would be to automate the code in model_selection to choose the best model by itself and returns as 'model_pipeline.pickle.' Otherwise, another idea is to automate and return a click.echo if SVC_bal turns out to not be the best in recall.

On software side, the docker can be downloaded as per intended in the instruction.

However, there is an error when I try to run pytest tests/* to test the functions. Probably, this is something that could be reviewed further.

And similar to Nicole's review above, there is an issue when running the python command for python scripts/optimization.py ... Probably, it will be good to check the whitespace and also use "\" to seperate the lines.

It has been a great read! Thank you!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: <@iris0614>

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

Dear Team, I wanted to take a moment to express my sincere appreciation for the outstanding work that Group 8 has done on the recent project. I must commend the team for the impeccable execution of the project.Great job, Group 8! Your hard work and collaboration have certainly paid off, and it's a pleasure to acknowledge your efforts. While the project is impressive, I'd like to offer some constructive feedback that might help elevate it even further. Please review the following points and feel free to discuss any questions or clarifications:

EDA Section: The Exploratory Data Analysis (EDA) section effectively communicates the questions to be answered. To enhance its effectiveness, please consider incorporating a correlation matrix. This visual representation can provide a clearer illustration of the correlation relationships within the data.
Report Enhancement: The final report is comprehensive and descriptive, covering all relevant content and questions. To streamline and automate certain aspects, consider using more consistent and automatic tools like glue, as taught in class. This can contribute to a more polished and standardized presentation.
Test Functionality: I encountered an error when attempting to run pytest tests/*. It would be beneficial to investigate and address this issue to ensure the reliability of the testing process. Let's collaborate to identify and rectify the problem.
Script Execution: Similar to the testing concerns, there appears to be an issue when running the Python command for python scripts/optimization.py. In summary, the project is exceptionally well-crafted, and the chosen topic demonstrates significant value. The conducted analysis is particularly noteworthy in effectively addressing the posed questions.

In conclusion, excellent work! I eagerly anticipate engaging in discussions and implementing the provided suggestions to further enhance the project's quality.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023