UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 13: Credit Card Default Prediction #11

Open renee-kwon opened 1 year ago

renee-kwon commented 1 year ago

Submitting authors: @ ChesterAiGo @hcwang24 @qurat-azim @renee-kwon

Repository: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13 Report link: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.html Abstract/executive summary: In the field of risk management, one of the most common problems is default prediction. This allows companies to predict the credibility of each person, analyze the risk level and optimize decisions for better business economics. In this project, we aim to learn and predict a credit card holder's credibility based on his/her basic personal information (gender, education, age, history of past payment etc. ).

Our final classifier using the Logistic Regression model did not perform as well as we hoped on our unseen test data, with a final f1 score of 0.471. Of the 6,000 clients in our test data, our model correctly predicted the default status of 4,141 clients correctly. There were 1,129 incorrect predictions, either predicting a customer will default on their payment when they have not or a customer will not default when they have. Incorrect predictions of either type can be costly for financial institutions and thus we will continue to study our data and improve our model before it is put into production.

We use a dataset hosted by the UCI machine learning repository. Originally it is collected by researchers from Chung Hua University and Tamkang University. As the probability of default cannot be actually acquired, the targets are obtained through estimation as stated by the authors of this dataset. The dataset consists of 30000 instances, with each consists of 23 attributes and a target. The raw dataset is about 5.5 MB large, and we split it into the training set (80%) and testing set (20%) for further use. The data attributes range from client's gender, age, education, previous payment history, credit amount etc.

Editor: @flor14 Reviewer: Flora Ouedraogo, Angela Chen, Wilfred Hass, Maryan Agyby

florawendy19 commented 1 year ago

Data analysis review checklist

Reviewer: florawendy19

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Your project is very concise, it has all the sections and the plots are well suited for the problem you are solving. Below are some few points which can help improve your project which is already great!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

WilfHass commented 1 year ago

Data analysis review checklist

Reviewer: @WilfHass

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Great project!! I thought the analysis was well done and thought out. I specifically enjoyed the discussion on Type 1 and 2 errors using the final estimator. I only have a few comments about the report/analysis and any discussion is very much welcome!

  1. Changing the output of the report.Rmd to include both html and github_document so that a quick glance in GitHub lets you check the report without having the clone the entire repo. Also, the README.md usage section could use code blocks (with 3 backticks) to allow users to copy and paste any commands and specifying where (which directory) the user should run the scripts is important!
  2. Changing where the results folder is stored as I believe it shouldn't exist in the data folder. Same with the eda_results folder.
  3. Having all of the imports at the top of the file is beneficial as well to avoid breaking up the flow. I did enjoy the commenting and easy readability of the code!
  4. I think the introduction of the analysis could benefit from explaining credibility further and how its calculated. I saw a little about how the credibility is determined in the classes (where the "0" class means the client pays off the debt, etc.) but I didn't feel like these key points were explicitly said in the introduction/summary. Also, since credibility is not necessary classed like this in the real world (people can pay back parts of their bill wihtout defaulting), I'm wondering how the project could be extrapolated out to using multi-classes and how that would affect the results of the analysis (in the "Further Improvements" section!).
  5. I also think a more in-depth explanation about why the Logistic Regression model was chosen over another (such as Ridge) as the final estimator. On a similar note, there were a lot of models being tried but the research question specified looking at feature importances so I was wondering why certain models like kNN were chosen to be compared since they don't have interpretable feature importances.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

marianagyby commented 1 year ago

Data analysis review checklist

Reviewer: @marianagyby

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~2

Review Comments:

Overall, the project is well thought out, organized, and thorough. The scripts are well documented and easy to read, and the report does a good job of taking the reader through the steps of your analysis. Very well done!

  1. The link to the final report provided in the ReadMe (https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/Makefile_report_hw/doc/report.md) leads to a 404 error. I believe you want to use this link instead: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.md

  2. The methods section and EDA in the report are very well-detailed, making it easy to understand the data set you are working with and the steps of your analysis.

  3. It may not be common knowledge what "default" means in terms of credibility, so it may be helpful to the reader to clearly define what is meant by the "default" class in the report. Does "default" mean that the person is credible or uncredible?

  4. The tables and plots are explained well in the discussion, but it can be useful to also include table/figure captions that summarize what they are showing. I see that you have figure and table captions specified in your code chunk options in the report.Rmd file, but they seem to be missing in the rendered report.md and report.html files.

  5. The discussion of the results and suggested improvements are very insightful and show that the model, despite its limitations, is promising with further imporvements. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

qurat-azim commented 1 year ago

Changes to project based on received feedback

We have received valuable feedback from peers as well as the teaching team for our project. On behalf of our group members, I would like share the improvements we have been able to make in our project.

  1. @WilfHass and @marianagyby: Thank you very much for suggesting the need for clarity for the classes in our prediction problem. Your valuable feedback has helped us improve our explanation of the question at hand. We agree that our reader could benefit from a bit more clarification on the classifications. We have added some details on this in the introduction section. This can be verified in this commit
  2. @florawendy19 and @WilfHass: Thank you again for pointing out the project's structural arrangement issues for us. We agree with you that it is more appropriate to have the results folder outside of the data folder. This will help our project reader with easier access to the results without having to sift through the entirety of the data folder. To make these changes, we have updated the source folders and destination folders in all our script files. We have also made necessary amends to the makefile to incorporate the restructuring. This is the link to these changes committed in our project repository.
  3. @danielramandi (TA): Thank you very much for the Milestone 2 feedback about improving our scripts and functions. This has helped us make positive changes to our project. We agree that the functions were prone to accepting faulty inputs. We have now added assert statements to ensure the inputs for the functions are in proper format. We have also added comments and documentation to ensure these parts of the code are understandable. One such improvement of our code is documented in this commit.
  4. @danielramandi (TA) : Thank you for mentioning in the Milestone 2 feedback that we have not used inline R code in our report. We have made the recommended edits in the following commit where we have updated the script to write analysis results and scores in a separate file, which we then use with inline commands in the report. The incorporated changes can be verified via this commit.
  5. @danielramandi (TA) and @Flor14: Thank you very much for bringing to our attention in Milestone 2 that the code could not run reproducibly with missing dependencies, and that the usage section command wasn't working. We have analysed all our dependencies and agree that there have been reproducibility issues with the dependencies we listed. We have now added an environment.yaml file in out repo to create an environment that should suffice for a user to reproduce the project. We have fixed the package names and versions. The command line usage section has also been updated accordingly. The project has come a long way since, and some of the commits addressing these issues are linked here, here, and here.
  6. @danielramandi (TA): Thank you for your feedback regarding the need for improved documentation for our scripts. We have come to agree that more comments and details about the code would be a good idea. We have added more comments and documentation to our scripts. In particular, we have added a short description for each script about the main functioning. This commit details the changes made in this regard.
  7. @florawendy19 and @WilfHass: You have rightly mentioned that the report for our project is not rendering in the most helpful format We have included alternative formats for better viewing for our reader. This change is documented in the form of a commit here

Thank you again for all the help and feedback. We have learnt a great deal from you.