UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 01: Credit Approval Prediction #15

Open spencergerlach opened 1 year ago

spencergerlach commented 1 year ago

Submitting authors: spencergerlach, DMerigo, Mengjun74, RussDim

Repository: https://github.com/UBC-MDS/Credit_Approval_Prediction Report link: https://github.com/UBC-MDS/Credit_Approval_Prediction/blob/main/doc/credit-appr-predict-report.html Abstract/executive summary:

Getting approved for a credit card depends on a number of factors. Credit card companies and banks can leverage machine learning models to help make quick and accurate decisions about who should be approved, and who should not.

This analysis used a Credit (Card) Approval Dataset from the UC Irvine Machine Learning Repository. The objective of this analysis was to build a model that could accurately predict whether an applicant will be approved or not for a credit card, based on various information about the applicant.

We tested two classification models to help with our prediction: k-NN and Logistic Regression. After building, tuning, and scoring our models, the Logistic Regression model was found to perform the best with an accuracy of 0.86 (scored on unseen test data).

Editor: @flor14 Reviewer: Mehdi Naji Esfahani, Samson Bakos, Lisa Sequiera, Mohammad Reza Nabizadeh Shahrbabak

samson-bakos commented 1 year ago

Data analysis review checklist

Reviewer: @samson-bakos

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

little bit misleading that (according to ReadMe) the modelling and testing data are the same script with different args. I see there is a testing script that is not listed in usage? If this is not the intended script, modelling/ testing scripts could be abstracted

There are some try/excepts for file writing, but generally testing is not robust (I think this was optional though in the milestone description?)

Reproducibility

dependencies listed, but if not already present on reproducing device it is a little tedious to install individually. A .yaml environment file would be nice

tricky to do so in the current iteration because the scripts in the usage arguments require manually entry of file names, which is hard to do without an in depth understanding of the inputs and outputs of each script. Using default names in the scripts so that the analysis can be quickly replicated without analyzing source code would be helpful, possibly with an addendum stating that these file names can be renamed if desired if the user understands the overall pipeline flow.

Analysis report

question is well defined, but not exactly why it is significant

Theres are two "insert code here" based on table values. Should be parametrized, or read in from csv via pandas? Results are presented effectively. Given that there is relative class balance accuracy makes sense as a metric, but (as a very optional suggestion) it may still be interesting to include a confusion matrix to investigate the ratios of type 1 and type 2 error relative to one of the classes, to investigate if the model is more often rejecting people who should be accepted, or accepting people who should be rejected.

conclusions are correct, but limitations should include not just limitations of conclusions but also limitations of the modelling process itself. What could be done better/ what could be done next?. i.e. a limiting factor of the analysis is that it considers only linear relations to the target, and there could be hidden useful polynomial/ interaction features that could be extracted via polynomial transformation and recursive feature selection, which may improve the model. Don't have to actually do this, but should state this could be done

There are some papers cited in .bib that are not in the body of the report

Estimated hours spent reviewing:

1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

For the sake of clarity, I included most of my detailed feedback under each check box both to be clear about what I was addressing, and for ease of review as I moved down the checklist without needing to jump back and forth. <Note for TAs checking this peer review, scroll up for the majority of my comments>

Overall, this has the bones of a very strong project with just a few things holding it back. It addresses a solid question, as automated credit approval prediction is a good use case for ML models. The analysis, while simple, is well thought out, well organized, and well executed. The overall structuring/ attributions of the project repo is very thorough and well done. The primary thing holding this project back are the challenges to manual reproduction of the data, but this could be easily remedied by providing scripts with default args in the usage section, and an environment file for easy computational environment reproduction/ dependency handling. Both things that should be handled automatically in future milestones with inclusion of programs like Docker :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

mrnabiz commented 1 year ago

Data analysis review checklist

Reviewer: Mohammad Reza Nabizadeh @mrnabiz

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

2 hours.

Review Comments:

  1. Making an environment YAML file would make the project more accessible and easier to reproduce.
  2. In terms of folder organization, I suggest keeping all the data in the data folder and all the scripts in the src folder. There is one data file in the src folder.
  3. The usage codes could be improved by using some prefilled values like: python src/download_data.py --url="https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" --out_path = "data/raw"
  4. The codes are well-written and well-commented. Great Job!
  5. There are some deprecated scripts which the could be tracked through the version control. I guess there is no need to have them in the script.
  6. The EDA analysis in thorough well-adressed. The null values replacement is handled elegantly.
  7. In the model and hyper parameter optimzations, increasing the number of itterrations for the logistics regression model will be helpful.
  8. In the results section, I suggest explaining more about the context of the credit approval interpreting how 85% accuracy is helpful or not.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

LisaSeq commented 1 year ago

Data analysis review checklist

Reviewer: Lisa Sequeira @LisaSeq

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

A few minor comments below (nice to have):

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

spencergerlach commented 1 year ago

Five listed pieces of feedback our group agreed with and addressed. All others were addressed, but the following are those with the largest impact on the final product. For some pieces of feedback, there were ongoing commits throughout the week, so only the most relevant commits are linked below, rather than linking 5+ commits for some of these items.

  1. Create an environment file to improve reproducibility (peer comment): link to latest related commit
  2. Fix references used in report (peer comment): link to latest related commit for report and reference update
  3. Expand sections of the report, including explanation of why we chose certain scoring metrics, and to expand upon various suggested limitations in the analysis (peer comment): link to related commit
  4. Fixed major issues in the Usage section, scripting commands updated to correct commands (peer comment): link to commit for new usage section - this commit relates to the new Usage section with the makefile instructions. link to Makefile commit with improved scripting commands (this is not the latest commit related to the makefile).
  5. Split long scripts/functions into smaller functions (mostly applies to the model creation script): link to related commit