Submission: GROUP 20: Credit Card Default Predictor

Submitting authors: @mozhao0331 @kenuiuc @Althrun-sun @rkrishnan-arjun

Repository: https://github.com/UBC-MDS/credit_default_prediction_group_20 Report link: https://github.com/UBC-MDS/credit_default_prediction_group_20/blob/main/doc/credit_default_analysis_report.md Abstract/executive summary: For this project we are trying to answer the question:

Given a credit card customer's payment history and demographic information like gender, age, and education level, would the customer default on the next bill payment?"

Answering this question is important because, with an effective predictive model, financial institutions can evaluate a customer's credit level and grant appropriate credit amount limits. This analysis would be crucial in credit score calculation and risk management.

Editor: @flor14 Reviewer: Li Sam, Ganacheva Elena, Feng Yurui, Wijngaarden Renzo

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @elenagan

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well-known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 30 minutes

Review Comments:

I liked how you considered a wide variety of models. You might want to explain why you chose these models in particularly in a little more detail.
The authors are listed clearly in the README but it might be a good idea to include your affiliations.
The introduction is clear and easy to understand, but there are some typos and unclear grammar in the more detailed sections about Data and Results & Discussion parts that may lead to some confusion.
The code was organized clearly in functions with tests, but it might be useful to have some of the functions defined outside of the main function so they could be used outside of it.
It might not be the best idea to score on all models explored. You might want to focus in on just the selected model you built.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Yurui-Feng

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hr

Review Comments:

Overall, your project is quite comprehensive, with a clear question and detailed analysis. I really like your scripts where each step of the analysis and modeling is divided into different functions, making the code readable.
The usage section could be improved to ensure better reproducibility. First, the download command should have --file_type set to "csv". Second, the command for running the EDA is not the same as the example command in the script.
Also, for the usage section, you might want to put each step's command into an individual code cell for easy copy-pasting in case one step didn't execute. Also, you might want to remove the square brackets for the optional argument (or you can include them without brackets).
It will be better if you add a conclusion section to briefly restate the research question and an overall comment on the models' performance (which you already did in Score Analysis). This makes it easier for the readers to find the take-home message of your project.
One more thing you might consider discussing is the inclusion of the sex feature, especially in the context of your project. Do you think including this feature will introduce gender bias to your model?

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1hr

Review Comments:

Your EDA report is well organized. Briefed the variables and the project's purpose very clearly with the introduction and various plots.
The scripts in the src directory include the whole process of the analysis from the data download, EDA, and model training to the model summary. All the results can be located easily in the results directory.
The analysis pipeline is well organized, with various fitting methods, models, and validation scores. It will be better if you add the pros and cons (based on overfitting, CV scores, etc.) of each method you use and state why you used them by combining your research question with the characteristics of each model.
In the Analysis Report, you explained clearly the main focus target of your analysis(lower the Type I and Type II errors). You provided the reason clearly for which scoring metric to use. It's suitable for the trade-off between precision and recall based on this specific real-life question.
There is a little suggestion that you might consider trying some feature engineering and selection work in order to discover more potential features and combinatoins to improve the overall scores of your models.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Hawknum

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1hr

Review Comments:

Your research question is very well outlined: it's clear what you're asking, why you're asking it, and how you're looking to answer. The plots you chose to accompany your findings make sense.
The results section of the report is hard to read: it includes a lot of field-specific technical terms and assumes a very high degree of prior knowledge in data science from the reader. Consider explaining your methodology in slightly lower-level terms: perhaps instead of listing all the models, you tried, perhaps you could focus more on the model that worked best, and explain why this is the case. This would make the report more engaging to a wider audience.
Some of your scripts have little to no comments, which makes them hard to understand as an outsider. Consider adding these to increase readability.
Your report includes the author, but not the contributors (other people in the group) and both the readme and report are missing affiliations. Adding these would increase transparency.
The whole project is well organized: your pipeline is clear, your files are well organized, and the project makes sense and looks good!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you for taking out time to review our project. Based on the feedback received, we've tried to improve the overall presentation of the report, and the workflow used to generate it.

Some of the key feedbacks and the commits that resolved them are:

Feedback from Aditi(TA): Packages provided within the environment.yaml file arent present in conda. Try to find using conda search. Version numbers vary and some packages are just not found. Errors while trying to create the environment. Issue: https://github.com/UBC-MDS/credit_default_prediction_group_20/issues/39

Commits that fix the environment.yaml:

Feedback from Peer Review: For the usage section, you might want to put each step's command into an individual code cell for easy copy-pasting in case one step didn't execute. Also, you might want to remove the square brackets for the optional argument (or you can include them without brackets).

Commit that removed the old code cells:

https://github.com/UBC-MDS/credit_default_prediction_group_20/commit/6dc7b7499b5fca8faa389bdc422c155f9d7db0d0
Final State of read me uses individual code cells: https://github.com/UBC-MDS/credit_default_prediction_group_20/blob/main/README.md

Feedback 3 from Peer Review: Your report includes the author, but not the contributors (other people in the group) and both the readme and report are missing affiliations. Adding these would increase transparency.

Commits that add contributors and affiliations in both the readme and the final report:

Feedback from Florencia: Activate GitHub pages for the report. Issue: https://github.ubc.ca/MDS-2022-23/DSCI_522_dsci-workflows_students/issues/5

Commits that specify a change in the final report:

https://github.com/UBC-MDS/credit_default_prediction_group_20/commit/4f9ed3d81e5491a4d5d5b6f6bf6c4d396c62f544
The final report can be accessed at here

Feedback from Aditi(TA): Figure 1 is a little blurry upon rendering the RMD, and y axis label gets cut off for figure 4. Issue: https://github.com/UBC-MDS/credit_default_prediction_group_20/issues/39

Commits that fixed this:

UBC-MDS / data-analysis-review-2022