Submitting authors: @ ChesterAiGo @hcwang24 @qurat-azim @renee-kwon

Repository: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13 Report link: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.html Abstract/executive summary: In the field of risk management, one of the most common problems is default prediction. This allows companies to predict the credibility of each person, analyze the risk level and optimize decisions for better business economics. In this project, we aim to learn and predict a credit card holder's credibility based on his/her basic personal information (gender, education, age, history of past payment etc. ).

Our final classifier using the Logistic Regression model did not perform as well as we hoped on our unseen test data, with a final f1 score of 0.471. Of the 6,000 clients in our test data, our model correctly predicted the default status of 4,141 clients correctly. There were 1,129 incorrect predictions, either predicting a customer will default on their payment when they have not or a customer will not default when they have. Incorrect predictions of either type can be costly for financial institutions and thus we will continue to study our data and improve our model before it is put into production.

We use a dataset hosted by the UCI machine learning repository. Originally it is collected by researchers from Chung Hua University and Tamkang University. As the probability of default cannot be actually acquired, the targets are obtained through estimation as stated by the authors of this dataset. The dataset consists of 30000 instances, with each consists of 23 attributes and a target. The raw dataset is about 5.5 MB large, and we split it into the training set (80%) and testing set (20%) for further use. The data attributes range from client's gender, age, education, previous payment history, credit amount etc.

Editor: @flor14 Reviewer: Flora Ouedraogo, Angela Chen, Wilfred Hass, Maryan Agyby

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: florawendy19

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[ ] What is the question: Do the authors clearly state the research question being asked?
[] Importance: Do the authors clearly state the importance for this research question?
[] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

Your project is very concise, it has all the sections and the plots are well suited for the problem you are solving. Below are some few points which can help improve your project which is already great!

First: I think it is important to change the output of the Rmd file in the doc folder to be in github_document so that you get and md file which will render well. You have an html file but on Github, it does not render well. As an outsider , I do not know how you report looks like visually (your graphs, tables...); i have to go to the folders to see them. So my suggestion is to change the output from html_document to github_document:
Second: If I want to run the script to produce the report locally, I will not be able to do so because it is not provided. It would be nice to have the script for also producing the report.
Third: I think it would be great in the project description, to expand more on the important of why credibility is important and to give some statistics and explain more how machine learning can be used to solve the credibility issue. Parts of it have been included in the research question which is great but I think it would be better if you move some of the explanations from the research question to the about section and expand more on it. I think the research question should come in a summarize way.
Lastly: I think it is great to have the result folder of the project outside the data folder. I think the data folder should only contain the raw and clean data. I think the result of your project should be easily accessible to the reader.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @WilfHass

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Great project!! I thought the analysis was well done and thought out. I specifically enjoyed the discussion on Type 1 and 2 errors using the final estimator. I only have a few comments about the report/analysis and any discussion is very much welcome!

Changing the output of the report.Rmd to include both html and github_document so that a quick glance in GitHub lets you check the report without having the clone the entire repo. Also, the README.md usage section could use code blocks (with 3 backticks) to allow users to copy and paste any commands and specifying where (which directory) the user should run the scripts is important!
Changing where the results folder is stored as I believe it shouldn't exist in the data folder. Same with the eda_results folder.
Having all of the imports at the top of the file is beneficial as well to avoid breaking up the flow. I did enjoy the commenting and easy readability of the code!
I think the introduction of the analysis could benefit from explaining credibility further and how its calculated. I saw a little about how the credibility is determined in the classes (where the "0" class means the client pays off the debt, etc.) but I didn't feel like these key points were explicitly said in the introduction/summary. Also, since credibility is not necessary classed like this in the real world (people can pay back parts of their bill wihtout defaulting), I'm wondering how the project could be extrapolated out to using multi-classes and how that would affect the results of the analysis (in the "Further Improvements" section!).
I also think a more in-depth explanation about why the Logistic Regression model was chosen over another (such as Ridge) as the final estimator. On a similar note, there were a lot of models being tried but the research question specified looking at feature importances so I was wondering why certain models like kNN were chosen to be compared since they don't have interpretable feature importances.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @marianagyby

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~2

Review Comments:

Overall, the project is well thought out, organized, and thorough. The scripts are well documented and easy to read, and the report does a good job of taking the reader through the steps of your analysis. Very well done!

The link to the final report provided in the ReadMe (https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/Makefile_report_hw/doc/report.md) leads to a 404 error. I believe you want to use this link instead: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.md
The methods section and EDA in the report are very well-detailed, making it easy to understand the data set you are working with and the steps of your analysis.
It may not be common knowledge what "default" means in terms of credibility, so it may be helpful to the reader to clearly define what is meant by the "default" class in the report. Does "default" mean that the person is credible or uncredible?
The tables and plots are explained well in the discussion, but it can be useful to also include table/figure captions that summarize what they are showing. I see that you have figure and table captions specified in your code chunk options in the report.Rmd file, but they seem to be missing in the rendered report.md and report.html files.
The discussion of the results and suggested improvements are very insightful and show that the model, despite its limitations, is promising with further imporvements. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Changes to project based on received feedback

We have received valuable feedback from peers as well as the teaching team for our project. On behalf of our group members, I would like share the improvements we have been able to make in our project.

@WilfHass and @marianagyby: Thank you very much for suggesting the need for clarity for the classes in our prediction problem. Your valuable feedback has helped us improve our explanation of the question at hand. We agree that our reader could benefit from a bit more clarification on the classifications. We have added some details on this in the introduction section. This can be verified in this commit
@florawendy19 and @WilfHass: Thank you again for pointing out the project's structural arrangement issues for us. We agree with you that it is more appropriate to have the results folder outside of the data folder. This will help our project reader with easier access to the results without having to sift through the entirety of the data folder. To make these changes, we have updated the source folders and destination folders in all our script files. We have also made necessary amends to the makefile to incorporate the restructuring. This is the link to these changes committed in our project repository.
@danielramandi (TA): Thank you very much for the Milestone 2 feedback about improving our scripts and functions. This has helped us make positive changes to our project. We agree that the functions were prone to accepting faulty inputs. We have now added assert statements to ensure the inputs for the functions are in proper format. We have also added comments and documentation to ensure these parts of the code are understandable. One such improvement of our code is documented in this commit.
@danielramandi (TA) : Thank you for mentioning in the Milestone 2 feedback that we have not used inline R code in our report. We have made the recommended edits in the following commit where we have updated the script to write analysis results and scores in a separate file, which we then use with inline commands in the report. The incorporated changes can be verified via this commit.
@danielramandi (TA) and @Flor14: Thank you very much for bringing to our attention in Milestone 2 that the code could not run reproducibly with missing dependencies, and that the usage section command wasn't working. We have analysed all our dependencies and agree that there have been reproducibility issues with the dependencies we listed. We have now added an environment.yaml file in out repo to create an environment that should suffice for a user to reproduce the project. We have fixed the package names and versions. The command line usage section has also been updated accordingly. The project has come a long way since, and some of the commits addressing these issues are linked here, here, and here.
@danielramandi (TA): Thank you for your feedback regarding the need for improved documentation for our scripts. We have come to agree that more comments and details about the code would be a good idea. We have added more comments and documentation to our scripts. In particular, we have added a short description for each script about the main functioning. This commit details the changes made in this regard.
@florawendy19 and @WilfHass: You have rightly mentioned that the report for our project is not rendering in the most helpful format We have included alternative formats for better viewing for our reader. This change is documented in the form of a commit here

Thank you again for all the help and feedback. We have learnt a great deal from you.

UBC-MDS / data-analysis-review-2022

Submission: Group 13: Credit Card Default Prediction #11

Data analysis review checklist

Reviewer: florawendy19

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Attribution

Data analysis review checklist

Reviewer: @WilfHass

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Attribution

Data analysis review checklist

Reviewer: @marianagyby

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~2

Review Comments:

Attribution

Changes to project based on received feedback