UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP 18: Credit Card Default Prediction #31

Open ciciecho-ds opened 2 years ago

ciciecho-ds commented 2 years ago

Submitting authors: @jamesktkim @Davidwang11 @ciciecho-ds @garhwalinauna

Repository: https://github.com/UBC-MDS/Credit-Card-Default-Prediction Report link: https://github.com/UBC-MDS/Credit-Card-Default-Prediction/blob/main/reports/_build/pdf/book.pdf

Abstract/executive summary: In this project, we attempt to build a classification model to predict whether a credit card customer is likely to default or not. Our research question is: given characteristics and payment history of a customer, is he or she likely to default on the credit card payment next month?

Our dataset contains 30,000 observations and 23 features, with no missing values. It was put together by I-Cheng Yeh at the Department of Information Management, Chun Hua University. We obtained this data from the UCI Machine Learning Repository. After training and evaluating different classification models, we selected and tuned a logistic regression model and our logistic model resulted in AUC of 0.768.

Editor: @flor14 Reviewer: @jennifer-hoang @gutermanyair @aimee0317 @zackt113

gutermanyair commented 2 years ago

Peer Review

Reviewer: @gutermanyair

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. First thing I would want to suggest is that your names are missing from the final report/book, I am sure you all worked very hard on this report so let the world know who you are, and that this is your work!

  2. I do really like the plots you created for your EDA but I just wish you would go into more detail on what conclusions you derived from these plots, as you just explained what EDA you were doing but did not go into any details about what you concluded from it. A suggestion could be to go into some detail about what you learned about some of the features in your data.

  3. I want to know more about why you decided to use ROC_AUC as your scoring metric? Why did you decide to use that one over all the others listed in your table. Maybe you guys could add a little paragraph under your table just explaining your thought process there.

  4. Something I find myself wanting as i read your report is a small little conclusion as to what is going on for each of your plots/figures. For example, for someone that isn't too familiar with the plots you are using, it would be super helpful to just explain the conclusions you are drawing from each plot/figure. I know that you included the plots to strengthen your report and they do strengthen it, but it would strengthen your report even more if you just quickly summarized the conclusions from each one briefly below.

  5. As far as your code and project reproducibility goes, it looks great to me! A job well done! I love how you included many many tests as this is great practice and makes me very confident in your code.

  6. Overall I think the only place for some improvement is just your report. I just find myself wanting more background as to why you are doing certain things in your analysis. I believe that just adding short mentions as to why you are taking certain steps and making certain decisions can go a very long way in greatly improving your report. Some quick example could be for example going into more detail as to why you are applying certain transformers to your data, or another example could be to tell us a bit more about why you are hyper tuning the model and over what range of values, and even why you picked this range.

  7. Great work, group 18! I did really enjoy reviewing your project, keep up the good work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zackt113 commented 2 years ago

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.75 hrs

Review Comments:

Overall, well done, Group 18! I like the idea of credit card default prediction, very relevant to the banking industry and our daily life.

  1. The names were missed in the report, please include your names in the report as those works are great, so please credit to yourself!
  2. In the section on choosing the best model, it was great to include diverse models and a variety of scoring matrices, but I think it would be better to provide some details and contexts on the reason why you chose those scoring matrices instead of using only one of it, such as recall. And also, in the end, you chose logistic regression as the final model to conduct hyper-parameter optimization. What is the reason for choosing ROC AUC as the final determinant instead of recall? I think it is probably because of the class imbalance issue and we have been taught that it would be better to use ROC in the circumstance?
  3. I think it would be great to include train scores for those models as well, as I am wondering whether the model has an overfitting or underfitting problem.
  4. As the model takes a lot of features into account, I think it would be interesting if you can carry out feature selection to reduce the number of features to avoid overfitting or expensive data collection cost. For example, Recursive Features Elimination would be a good start.
  5. As a reader, I would appreciate it if there is a table, which listed the magnitude of each feature. Therefore, I can conclude some interesting findings from the table. I would be interested to know which factor is the main driver of the credit card default prediction and the direction of how it's going to drive the prediction.
  6. The reproducibility is great as I have no issue running the code.
  7. Another thing that may be able to improve is the naming of files. For example, the final report is named "book.pdf". I think it would be better to include your project name or main findings. Just like what you did on the repo name, it was informative!
  8. Also, one last point, it might be better to discuss more your future plan on the project in the reservations and suggestions section. For example, you mentioned it would be better if there are more relevant features that could be included and the data set is a bit outdated. How would you deal with the issues? Any preliminary or tentative plan? Those limitations are very good points, so I would like to hear more about how would you solve the issues in the future!

Again, fairly good work! And I got some great ideas for our project after reading yours! Thank you!

Zack

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

aimee0317 commented 2 years ago

Peer Review Checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Overall, you've done a fantastic job and the topic itself is intriguing. Here are my comments:

  1. Your names used to be missing from the report. It would be a great idea to remember to give yourselves credits for working on the project and successfully completing two milestones.
  2. In the initial EDA, it would be better to talk about general strategy to solve class imbalance issue vs being specific about setting the parameter to "balanced." We cannot reasonably assume all the readers know that the sklearn package will be used; some might become perplexed regarding the syntax you are referring to. If we want to be specific by mentioning the sklearn package and the parameter early, it might be a good idea to elaborate on what setting the class_weight to "balanced" entails.
  3. It looks like the resolution of the EDA plots is a bit lower than ideal because I cannot see the labels clearly. For instance, in the Fig2, when I look at the marriage subplot, I cannot see clearly each level contained in the marriage feature. My group received comments from the TA regarding plot resolution so I think it might be helpful to point out here as well. From my understanding, the default resolution was 2 while exporting if you are using Altair. There might be a way to increase that.
  4. This point is very minor, but one of your titles is "splitting and cleaning the model." I believe you meant "splitting and cleaning the data." While we can reasonably assume that people in our program know what you meant, it might be helpful to make the titles more accurate so people without in-depth knowledge can understand what you are doing a bit better.
  5. It would be helpful to show the training scores. Although we care more about the test scores, by looking at the training scores, we can infer whether we have underfitting or overfitting issues.
  6. In the result section, it might be worth elaborating a bit more on the interpretation of the ROC curve as well as the confusion matrix in the context of your question and model.

Again, overall, I think all of you did a wonderful job! I am looking forward to seeing how this project develops towards the end of our course!

Best, Amelia

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jennifer-hoang commented 2 years ago

Data analysis review checklist

Reviewer: @jennifer-hoang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hr

Review Comments:

Excellent work, group 18! Your project and scripts were very organized and well-documented, and I learned a lot from reviewing them. I have only a few suggestions regarding the analysis report:

Overall, this project was really well done!

Best, Jennifer

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jamesktkim commented 2 years ago

Hi Everyone,

Thanks for all the valuable feedbacks. We have reviewed all your feedbacks in addition to the ones from TAs and the instructor and have implemented the following changes collectively:

1/ Names were missing in the final report but they are added now (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/bdd3d3436cee70c3b3043bd6afc5f20f2ec3760b

2/ Paragraph about reasons for choosing ROC_AUC as the metric is added now (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/d062a9bcee57c66451a101bd785ff68f15d4f837

3/ One of the subtitles of the report is fixed as “splitting and cleaning the data” (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/d062a9bcee57c66451a101bd785ff68f15d4f837

4/ The figures / tables are now referenced in the report with numbers (Peer & TA) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/606ce0847f6d9e9e9c00df3299c5a8e8ba779d2a

5/ Train scores are added in the result (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/456429ed68cdd462d86085d07c7f9ce7bba337c1

6/ Model coefficients are added in the result to show which features were most useful (Peer) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/cee0bfae26d1a46f64ecb2493765184b28d04bc0

7/ Our names are added in the license file (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/b9f93d3d8ddcab3ef3eeb6c6894c9e45a2e91456

8/ Our emails are added in the code of conduct file (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/f37511ef09104e31098d2858b15266cc79dec9c7

9/ Executive summary is added in the report (Florencia) https://github.com/UBC-MDS/Credit-Card-Default-Prediction/commit/c14627af6fafcb31aeae6966897454d7e939bbad

Thanks again for your constructive feedbacks which greatly helped us to improve our project and the final report. If there is any other issue or concern, please do not hesitate to let us know.