Submission: GROUP 18: Credit Card Default Prediction

Submitting authors: @jamesktkim @Davidwang11 @ciciecho-ds @garhwalinauna

Repository: https://github.com/UBC-MDS/Credit-Card-Default-Prediction Report link: https://github.com/UBC-MDS/Credit-Card-Default-Prediction/blob/main/reports/_build/pdf/book.pdf

Abstract/executive summary: In this project, we attempt to build a classification model to predict whether a credit card customer is likely to default or not. Our research question is: given characteristics and payment history of a customer, is he or she likely to default on the credit card payment next month?

Our dataset contains 30,000 observations and 23 features, with no missing values. It was put together by I-Cheng Yeh at the Department of Information Management, Chun Hua University. We obtained this data from the UCI Machine Learning Repository. After training and evaluating different classification models, we selected and tuned a logistic regression model and our logistic model resulted in AUC of 0.768.

Editor: @flor14 Reviewer: @jennifer-hoang @gutermanyair @aimee0317 @zackt113

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Peer Review

Reviewer: @gutermanyair

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

First thing I would want to suggest is that your names are missing from the final report/book, I am sure you all worked very hard on this report so let the world know who you are, and that this is your work!
I do really like the plots you created for your EDA but I just wish you would go into more detail on what conclusions you derived from these plots, as you just explained what EDA you were doing but did not go into any details about what you concluded from it. A suggestion could be to go into some detail about what you learned about some of the features in your data.
I want to know more about why you decided to use ROC_AUC as your scoring metric? Why did you decide to use that one over all the others listed in your table. Maybe you guys could add a little paragraph under your table just explaining your thought process there.
Something I find myself wanting as i read your report is a small little conclusion as to what is going on for each of your plots/figures. For example, for someone that isn't too familiar with the plots you are using, it would be super helpful to just explain the conclusions you are drawing from each plot/figure. I know that you included the plots to strengthen your report and they do strengthen it, but it would strengthen your report even more if you just quickly summarized the conclusions from each one briefly below.
As far as your code and project reproducibility goes, it looks great to me! A job well done! I love how you included many many tests as this is great practice and makes me very confident in your code.
Overall I think the only place for some improvement is just your report. I just find myself wanting more background as to why you are doing certain things in your analysis. I believe that just adding short mentions as to why you are taking certain steps and making certain decisions can go a very long way in greatly improving your report. Some quick example could be for example going into more detail as to why you are applying certain transformers to your data, or another example could be to tell us a bit more about why you are hyper tuning the model and over what range of values, and even why you picked this range.
Great work, group 18! I did really enjoy reviewing your project, keep up the good work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.75 hrs

Review Comments:

Overall, well done, Group 18! I like the idea of credit card default prediction, very relevant to the banking industry and our daily life.

The names were missed in the report, please include your names in the report as those works are great, so please credit to yourself!
In the section on choosing the best model, it was great to include diverse models and a variety of scoring matrices, but I think it would be better to provide some details and contexts on the reason why you chose those scoring matrices instead of using only one of it, such as recall. And also, in the end, you chose logistic regression as the final model to conduct hyper-parameter optimization. What is the reason for choosing ROC AUC as the final determinant instead of recall? I think it is probably because of the class imbalance issue and we have been taught that it would be better to use ROC in the circumstance?
I think it would be great to include train scores for those models as well, as I am wondering whether the model has an overfitting or underfitting problem.
As the model takes a lot of features into account, I think it would be interesting if you can carry out feature selection to reduce the number of features to avoid overfitting or expensive data collection cost. For example, Recursive Features Elimination would be a good start.
As a reader, I would appreciate it if there is a table, which listed the magnitude of each feature. Therefore, I can conclude some interesting findings from the table. I would be interested to know which factor is the main driver of the credit card default prediction and the direction of how it's going to drive the prediction.
The reproducibility is great as I have no issue running the code.
Another thing that may be able to improve is the naming of files. For example, the final report is named "book.pdf". I think it would be better to include your project name or main findings. Just like what you did on the repo name, it was informative!
Also, one last point, it might be better to discuss more your future plan on the project in the reservations and suggestions section. For example, you mentioned it would be better if there are more relevant features that could be included and the data set is a bit outdated. How would you deal with the issues? Any preliminary or tentative plan? Those limitations are very good points, so I would like to hear more about how would you solve the issues in the future!

Again, fairly good work! And I got some great ideas for our project after reading yours! Thank you!

Zack

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Peer Review Checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Overall, you've done a fantastic job and the topic itself is intriguing. Here are my comments:

Your names used to be missing from the report. It would be a great idea to remember to give yourselves credits for working on the project and successfully completing two milestones.
In the initial EDA, it would be better to talk about general strategy to solve class imbalance issue vs being specific about setting the parameter to "balanced." We cannot reasonably assume all the readers know that the sklearn package will be used; some might become perplexed regarding the syntax you are referring to. If we want to be specific by mentioning the sklearn package and the parameter early, it might be a good idea to elaborate on what setting the class_weight to "balanced" entails.
It looks like the resolution of the EDA plots is a bit lower than ideal because I cannot see the labels clearly. For instance, in the Fig2, when I look at the marriage subplot, I cannot see clearly each level contained in the marriage feature. My group received comments from the TA regarding plot resolution so I think it might be helpful to point out here as well. From my understanding, the default resolution was 2 while exporting if you are using Altair. There might be a way to increase that.
This point is very minor, but one of your titles is "splitting and cleaning the model." I believe you meant "splitting and cleaning the data." While we can reasonably assume that people in our program know what you meant, it might be helpful to make the titles more accurate so people without in-depth knowledge can understand what you are doing a bit better.
It would be helpful to show the training scores. Although we care more about the test scores, by looking at the training scores, we can infer whether we have underfitting or overfitting issues.
In the result section, it might be worth elaborating a bit more on the interpretation of the ROC curve as well as the confusion matrix in the context of your question and model.

Again, overall, I think all of you did a wonderful job! I am looking forward to seeing how this project develops towards the end of our course!

Best, Amelia

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @jennifer-hoang

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hr

Review Comments:

Excellent work, group 18! Your project and scripts were very organized and well-documented, and I learned a lot from reviewing them. I have only a few suggestions regarding the analysis report:

It was mentioned that the data was split using a default setting of 0.2 split ratio. It would be helpful to clarify for a general audience what this means - was 20% the size of the train or test set?
It would be helpful to discuss the advantages of using ROC_AUC as the main scoring metric for choosing your best model.
I was curious about which features were most useful for your prediction problem in your Logistic Regression model, and it would be interesting to see a feature importance table.
In the 'Model Results' section, it would be helpful to state the scoring metrics for your model on the test data for the reader to compare with the validation scores above.
I see that you have included an 'Executive Summary' at the start of your report, but it would be great to also have a final 'Summary' at the end of the report to summarize your test results and to interpret these metrics (ROC AUC, f1, recall) in the context of your classification problem.
References: I was able to see your References section, but wasn't able to see in-text citations in the pdf report. It would be great to integrate these citations into the report to give credit to the authors of the packages used and to help readers find the tools that you used.

Overall, this project was really well done!

Best, Jennifer

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi Everyone,

Thanks for all the valuable feedbacks. We have reviewed all your feedbacks in addition to the ones from TAs and the instructor and have implemented the following changes collectively:

1/ Names were missing in the final report but they are added now (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/bdd3d3436cee70c3b3043bd6afc5f20f2ec3760b

2/ Paragraph about reasons for choosing ROC_AUC as the metric is added now (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/d062a9bcee57c66451a101bd785ff68f15d4f837

3/ One of the subtitles of the report is fixed as “splitting and cleaning the data” (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/d062a9bcee57c66451a101bd785ff68f15d4f837

4/ The figures / tables are now referenced in the report with numbers (Peer & TA) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/606ce0847f6d9e9e9c00df3299c5a8e8ba779d2a

5/ Train scores are added in the result (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/456429ed68cdd462d86085d07c7f9ce7bba337c1

6/ Model coefficients are added in the result to show which features were most useful (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/cee0bfae26d1a46f64ecb2493765184b28d04bc0

7/ Our names are added in the license file (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/b9f93d3d8ddcab3ef3eeb6c6894c9e45a2e91456

8/ Our emails are added in the code of conduct file (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/f37511ef09104e31098d2858b15266cc79dec9c7

9/ Executive summary is added in the report (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/c14627af6fafcb31aeae6966897454d7e939bbad

Thanks again for your constructive feedbacks which greatly helped us to improve our project and the final report. If there is any other issue or concern, please do not hesitate to let us know.

UBC-MDS / data-analysis-review-2021