Submission: Group 01: Credit Approval Prediction

Submitting authors: spencergerlach, DMerigo, Mengjun74, RussDim

Repository: https://github.com/UBC-MDS/Credit_Approval_Prediction Report link: https://github.com/UBC-MDS/Credit_Approval_Prediction/blob/main/doc/credit-appr-predict-report.html Abstract/executive summary:

Getting approved for a credit card depends on a number of factors. Credit card companies and banks can leverage machine learning models to help make quick and accurate decisions about who should be approved, and who should not.

This analysis used a Credit (Card) Approval Dataset from the UC Irvine Machine Learning Repository. The objective of this analysis was to build a model that could accurately predict whether an applicant will be approved or not for a credit card, based on various information about the applicant.

We tested two classification models to help with our prediction: k-NN and Logistic Regression. After building, tuning, and scoring our models, the Logistic Regression model was found to perform the best with an accuracy of 0.86 (scored on unseen test data).

Editor: @flor14 Reviewer: Mehdi Naji Esfahani, Samson Bakos, Lisa Sequiera, Mohammad Reza Nabizadeh Shahrbabak

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @samson-bakos

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

Scripts adapted from Tiffany are attributed, but attributes/ README descriptions could be added for other major packages (i.e. brief attribution to Scikit-learn and Altair). Documentation in scripts/Analysis is fine though
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

little bit misleading that (according to ReadMe) the modelling and testing data are the same script with different args. I see there is a testing script that is not listed in usage? If this is not the intended script, modelling/ testing scripts could be abstracted

[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

There are some try/excepts for file writing, but generally testing is not robust (I think this was optional though in the milestone description?)

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

dependencies listed, but if not already present on reproducing device it is a little tedious to install individually. A .yaml environment file would be nice

[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

tricky to do so in the current iteration because the scripts in the usage arguments require manually entry of file names, which is hard to do without an in depth understanding of the inputs and outputs of each script. Using default names in the scripts so that the analysis can be quickly replicated without analyzing source code would be helpful, possibly with an addendum stating that these file names can be renamed if desired if the user understands the overall pipeline flow.

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?

question is well defined, but not exactly why it is significant

[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?

Theres are two "insert code here" based on table values. Should be parametrized, or read in from csv via pandas? Results are presented effectively. Given that there is relative class balance accuracy makes sense as a metric, but (as a very optional suggestion) it may still be interesting to include a confusion matrix to investigate the ratios of type 1 and type 2 error relative to one of the classes, to investigate if the model is more often rejecting people who should be accepted, or accepting people who should be rejected.

[x] Conclusions: Are the conclusions presented by the authors correct?

conclusions are correct, but limitations should include not just limitations of conclusions but also limitations of the modelling process itself. What could be done better/ what could be done next?. i.e. a limiting factor of the analysis is that it considers only linear relations to the target, and there could be hidden useful polynomial/ interaction features that could be extracted via polynomial transformation and recursive feature selection, which may improve the model. Don't have to actually do this, but should state this could be done

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

There are some papers cited in .bib that are not in the body of the report

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

For the sake of clarity, I included most of my detailed feedback under each check box both to be clear about what I was addressing, and for ease of review as I moved down the checklist without needing to jump back and forth. <Note for TAs checking this peer review, scroll up for the majority of my comments>

Overall, this has the bones of a very strong project with just a few things holding it back. It addresses a solid question, as automated credit approval prediction is a good use case for ML models. The analysis, while simple, is well thought out, well organized, and well executed. The overall structuring/ attributions of the project repo is very thorough and well done. The primary thing holding this project back are the challenges to manual reproduction of the data, but this could be easily remedied by providing scripts with default args in the usage section, and an environment file for easy computational environment reproduction/ dependency handling. Both things that should be handled automatically in future milestones with inclusion of programs like Docker :)

Small fix that could be made, but didn't fit under any bullet point: Milestone 1 release is currently marked as 'Latest' and appears first in Repo. Milestone 2 is available, but hidden. Switch Milestone 2 Release to Latest.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Mohammad Reza Nabizadeh @mrnabiz

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

2 hours.

Review Comments:

Making an environment YAML file would make the project more accessible and easier to reproduce.
In terms of folder organization, I suggest keeping all the data in the data folder and all the scripts in the src folder. There is one data file in the src folder.
The usage codes could be improved by using some prefilled values like: python src/download_data.py --url="https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data" --out_path = "data/raw"
The codes are well-written and well-commented. Great Job!
There are some deprecated scripts which the could be tracked through the version control. I guess there is no need to have them in the script.
The EDA analysis in thorough well-adressed. The null values replacement is handled elegantly.
In the model and hyper parameter optimzations, increasing the number of itterrations for the logistics regression model will be helpful.
In the results section, I suggest explaining more about the context of the credit approval interpreting how 85% accuracy is helpful or not.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Lisa Sequeira @LisaSeq

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness? LS: See below.

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies? LS: See comments below.
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct? LS: See below, related to scoring and limitations.
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Much like the earlier reviewers, I feel like this project was very well laid out and has great potential! A question to consider is, given you don't understand what features have been provided (A1 to A16), how would one go about interpreting your model? Perhaps something worth mentioning as a limitation.
Also I'm curious given your final analysis on your model's limitations, why you didn't use a confusion matrix to see how many false positives or false negatives (or similar scoring metrics) to better evaluate its performance? It might also be a good idea to clarify directly in the analysis report what type of scoring metric you used for the final evaluations of the models, and why you used it.
In your analysis you selected two models to start off, why didn't you use a decision tree or any other classifiers? It would be great if this too could be explained.
Overall very nicely laid out folder structure, I would recommend removing any unused files such as .DS_store. Also as noted by the earlier reviewers there is a data file (.crx) in your src folder.
For a future release, you might consider adding some tests to show your code is working as expected and for future users.
For the YAML file, please test this. The structure looks similar to a file we had previously in which the dependencies were generated without using the recommend code from DSCI 521 Lecture 6 and resulted in quite some confusion within our group. In order to accommodate differences between operating systems this file should be generated using the --from-history flag.

A few minor comments below (nice to have):

In order for your project to continue to reflect the professionalism it has displayed throughout, it might be good to review any missing or broken links (I noticed one or two missing links in your analysis report, one of them mentioned by a reviewer above).
It would also be nice to have a version of the final analysis report that can be rendered directly in GitHub (perhaps a .md version?).
Additional visual graphs can be added to the analysis report to keep the reader engaged in reviewing the scoring metrics or to get your overall messages to be more impactful.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Five listed pieces of feedback our group agreed with and addressed. All others were addressed, but the following are those with the largest impact on the final product. For some pieces of feedback, there were ongoing commits throughout the week, so only the most relevant commits are linked below, rather than linking 5+ commits for some of these items.

Create an environment file to improve reproducibility (peer comment): link to latest related commit
Fix references used in report (peer comment): link to latest related commit for report and reference update
Expand sections of the report, including explanation of why we chose certain scoring metrics, and to expand upon various suggested limitations in the analysis (peer comment): link to related commit
Fixed major issues in the Usage section, scripting commands updated to correct commands (peer comment): link to commit for new usage section - this commit relates to the new Usage section with the makefile instructions. link to Makefile commit with improved scripting commands (this is not the latest commit related to the makefile).
Split long scripts/functions into smaller functions (mostly applies to the model creation script): link to related commit

UBC-MDS / data-analysis-review-2022