Submission: GROUP_11: Credit default payment predictor

Submitting authors: @liannah @Arushi282 @thayeylolu @karanpreetkaur

Repository: https://github.com/UBC-MDS/credit_default_prediction Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/credit_default_prediction/blob/main/doc/credit_default_prediction_report.html Abstract/executive summary:

In this project, we built a classification model using Logistic Regression to predict if credit account holders will make a default payment next month. The model was trained on features that hold information about the client’s last 6 months bill and payment history, as well as several other characteristics such as: age, marital status, education, and gender. Overall, we are more interested in minimizing Type | error (predicting no default payment, when in reality the client made a default payment the following month), as opposed to Type || error (predicting default payment, when in reality no default payment was made by the client), we are using f1 as our primary scoring metric. Our model performed fairly well on test data set with the f1 score being ~0.53. Our recall and precision rate are moderately high, being ~0.48, ~0.59 respectively. The given scores are consistent with the train data set scores, thus we can say that the model is generalizable on unseen data. However, the scores are not high, and our model is error prompt. The model can correctly classify default payments roughly half of the time. The value of incorrectly identifying default or no default can cause a lot of money and reputation to the company, thus we recommend continuing study to improve this prediction model before it is put into production in the credit companies. Some of the improvement research topics can be feature engineering, bigger dataset collected from other countries (China, Canada, Japan).

The data set used in the project is created by Yeh, I. C., and Lien, C. H (Yeh and Lien 2009), and made publicly available for download in UCI Machine Learning Repository (“default of credit card clients” 2016). The data can be found here. The dataset is based on Taiwan’s credit card client default cases from April to September. It has 30000 examples, and each example represents particular client’s information. The dataset has 24 observations with respective values such as gender, age, marital status, last 6 months bills, last 6 months payments, etc, including the final default payment of next month column: labeled 1 (client will make a default) and 0 (client will not make a default).

Editor: @flor14 Reviewer: @Mahm00d27 @jessie14 @ming0701 @Kendy-Tan

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @jessie14

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
- [x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
- [x] Conclusions: Are the conclusions presented by the authors correct?
- [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
- [x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

It would have been nice to see a separate Conclusions section in the report to summarize the findings and results of your analysis.
There are references to figure numbers in the report, but I don't see any figure captions or numbering on any of the figures in the report
The x-axis on the bar graphs for the ordinal and categorical features could be more descriptive. Displaying the actual categories in the plots would have been easier to understand rather than describing what the arbitrary numbers mean. The values are also unnecessarily rotated.
In the last section of the report that discusses the Random Search that was performed to optimize the hyperparemeters, the chosen best parameters do not produce a significantly better score than the leading parameters. Perhaps some further discussion could have been made to consider any limitations of the random search.
Overall the final report is engaging and well presented. There are a couple minor grammatical and formatting errors in the report that could be improved upon, but otherwise a very smooth and interesting read!

Attribution This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @ming0701

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

In the method section, it would be good to discuss more on,
- why logistic regression is selected? (random forest is mentioned at the end of the report, why it is not used?)
- how the hyperparameters are chosen?
- what's the percentage of the train test split?
It would be better to have results and discussion as a separate section instead of grouping it under method section.
In the EDA part of the final report, it would be good to show that there is class imbalance as this issue is mentioned in the later part of the report when explaining the confusion matrix.
In order to understand more about the data, the final report may include more content from EDA.ipynb , for example, adding a heatmap showing the correlation of the features.
I would suggest to add AP score and AUC as these two scores are meaningful when there is class imbalance issue.
There are some typos or referring to wrong figure, for example, "Figure 3 gives a glimpse on how we went about finding the best hyperparameters for the Logistic Regression model." This should refer to figure 5.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Reviewer: @Kendy-Tan

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Since the data is imbalanced, the with non default data around 73%, while the test accuracy rate for the updated model reduced to 77%. Even though the accuracy is not target,it would be better to explain the possible reason of this decrease.
In the EDA the difference features distribution of two class is not really observable, the part not overlapping may because of the imbalanced data, may consider try other type of graphes to show the difference.
The references to figure numbers in the report are not match with figure captions or numbe on the figures in the report.
I suggest to separate the result and conclusion in two subsection, since it is not clear to read the conclusion and the ending section now is a part of the result from the second model, so you may consider to reoder last few paragraphs.
I also suggest to add AP score and AUC evaluation matrix, since they are useful to show the goodness of the model in class imbalance situation.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Mahm00d27

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

In the "License", the copyright can be claimed by the authors, instead of referring to the "Master of Data Science at the University of British Columbia".
In the "License", the "Project" could be referred, instead of "Software".
The report though elaborately define the problem in hand but insufficiently shed lights on the rationale of the project. Suggestions can be, to include some information on current practices, usefulness of the prediction and criticism of existing other methods. Suggestion would be to think of a proper "signing off" in the report, where it seems that the writer has more to say. Like pointing to a sequel.
The "Usage" section is written with an authoritative choice of language. Can be passive, like "By cloning this GitHub repository, the analysis can be replicated"
Rather than using stacked bar, box-plots or violin-plot could have captured more insights during the exploratory data analysis. Results and discussion can easily be broken down to pieces for comfortable reading. Specially here, a "Conclusion" would be more appropriate instead of discussion, because we are interpreting actual results.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you all @mahm00d27 @jessie14 @ming0701 @Kendy-Tan for your feedback . We @thayeylolu @karanpreetkaur @liannah and @Arushi282 have made some of the proposed changes from your suggestions

[x] We have changed the file names for better description of our work. This can be found here
[x] We restructured our report to better highlight our findings. This can be found here
[x] We defined the context of default payment.This can be found here and here
[x] We added our names as the copyright holders and changed software to project in the license file. This can be found here
[x] We explained the final model’s performance better. This can be found here and [here]()
[x] We discussed the limitation of random search hyper parameterisation. This can be found here

UBC-MDS / data-analysis-review-2021