UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP_11: Credit default payment predictor #16

Open liannah opened 2 years ago

liannah commented 2 years ago

Submitting authors: @liannah @Arushi282 @thayeylolu @karanpreetkaur

Repository: https://github.com/UBC-MDS/credit_default_prediction Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/credit_default_prediction/blob/main/doc/credit_default_prediction_report.html Abstract/executive summary:

In this project, we built a classification model using Logistic Regression to predict if credit account holders will make a default payment next month. The model was trained on features that hold information about the client’s last 6 months bill and payment history, as well as several other characteristics such as: age, marital status, education, and gender. Overall, we are more interested in minimizing Type | error (predicting no default payment, when in reality the client made a default payment the following month), as opposed to Type || error (predicting default payment, when in reality no default payment was made by the client), we are using f1 as our primary scoring metric. Our model performed fairly well on test data set with the f1 score being ~0.53. Our recall and precision rate are moderately high, being ~0.48, ~0.59 respectively. The given scores are consistent with the train data set scores, thus we can say that the model is generalizable on unseen data. However, the scores are not high, and our model is error prompt. The model can correctly classify default payments roughly half of the time. The value of incorrectly identifying default or no default can cause a lot of money and reputation to the company, thus we recommend continuing study to improve this prediction model before it is put into production in the credit companies. Some of the improvement research topics can be feature engineering, bigger dataset collected from other countries (China, Canada, Japan).

The data set used in the project is created by Yeh, I. C., and Lien, C. H (Yeh and Lien 2009), and made publicly available for download in UCI Machine Learning Repository (“default of credit card clients” 2016). The data can be found here. The dataset is based on Taiwan’s credit card client default cases from April to September. It has 30000 examples, and each example represents particular client’s information. The dataset has 24 observations with respective values such as gender, age, marital status, last 6 months bills, last 6 months payments, etc, including the final default payment of next month column: labeled 1 (client will make a default) and 0 (client will not make a default).

Editor: @flor14 Reviewer: @Mahm00d27 @jessie14 @ming0701 @Kendy-Tan

jessie14 commented 2 years ago

Data analysis review checklist

Reviewer: @jessie14

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Attribution This was derived from the JOSE review checklist and the ROpenSci review checklist.

ming0701 commented 2 years ago

Data analysis review checklist

Reviewer: @ming0701

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

  1. In the method section, it would be good to discuss more on,
    • why logistic regression is selected? (random forest is mentioned at the end of the report, why it is not used?)
    • how the hyperparameters are chosen?
    • what's the percentage of the train test split?
  2. It would be better to have results and discussion as a separate section instead of grouping it under method section.
  3. In the EDA part of the final report, it would be good to show that there is class imbalance as this issue is mentioned in the later part of the report when explaining the confusion matrix.
  4. In order to understand more about the data, the final report may include more content from EDA.ipynb , for example, adding a heatmap showing the correlation of the features.
  5. I would suggest to add AP score and AUC as these two scores are meaningful when there is class imbalance issue.
  6. There are some typos or referring to wrong figure, for example, "Figure 3 gives a glimpse on how we went about finding the best hyperparameters for the Logistic Regression model." This should refer to figure 5.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Kendy-Tan commented 2 years ago

Reviewer: @Kendy-Tan

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 hours

Review Comments:

  1. Since the data is imbalanced, the with non default data around 73%, while the test accuracy rate for the updated model reduced to 77%. Even though the accuracy is not target,it would be better to explain the possible reason of this decrease.
  2. In the EDA the difference features distribution of two class is not really observable, the part not overlapping may because of the imbalanced data, may consider try other type of graphes to show the difference.
  3. The references to figure numbers in the report are not match with figure captions or numbe on the figures in the report.
  4. I suggest to separate the result and conclusion in two subsection, since it is not clear to read the conclusion and the ending section now is a part of the result from the second model, so you may consider to reoder last few paragraphs.
  5. I also suggest to add AP score and AUC evaluation matrix, since they are useful to show the goodness of the model in class imbalance situation.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

mahm00d27 commented 2 years ago

Data analysis review checklist

Reviewer: @Mahm00d27

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 3 hours

Review Comments:

  1. In the "License", the copyright can be claimed by the authors, instead of referring to the "Master of Data Science at the University of British Columbia".
  2. In the "License", the "Project" could be referred, instead of "Software".
  3. The report though elaborately define the problem in hand but insufficiently shed lights on the rationale of the project. Suggestions can be, to include some information on current practices, usefulness of the prediction and criticism of existing other methods. Suggestion would be to think of a proper "signing off" in the report, where it seems that the writer has more to say. Like pointing to a sequel.
  4. The "Usage" section is written with an authoritative choice of language. Can be passive, like "By cloning this GitHub repository, the analysis can be replicated"
  5. Rather than using stacked bar, box-plots or violin-plot could have captured more insights during the exploratory data analysis. Results and discussion can easily be broken down to pieces for comfortable reading. Specially here, a "Conclusion" would be more appropriate instead of discussion, because we are interpreting actual results.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

thayeylolu commented 2 years ago

Thank you all @mahm00d27 @jessie14 @ming0701 @Kendy-Tan for your feedback . We @thayeylolu @karanpreetkaur @liannah and @Arushi282 have made some of the proposed changes from your suggestions