Submission: Group 12: Predictive Modelling: German Credit Risk

Submitting authors: Shahrukh Islam Prithibi, Sophie Yang, Yovindu Don, Jade Bouchard

Repository: https://github.com/DSCI-310-2024/DSCI310_Group-12_Credit-Risk-Classification/releases/tag/v2.0.0

Abstract/executive summary:

The goal of our analysis is to classify whether someone is a good or bad credit risk using attributes such as Credit History, Duration, and Residence. Our best-performing model is a Random Forest model. This model gave us an accuracy of 0.8 on unseen data, a decent result compared to the dummy model's accuracy of 0.7. We also obtained a precision score of 0.8, a recall score of 0.95, and F1 Score of 0.87. Our model performs decently well in terms of identifying people who are a good credit risk. However, if this model is to have a hand in real-world decision-making, precision should be improved to minimize classifying poor credit risks as good credit risks (false positives). In addition, more research should be done to ensure the model produces fair and equitable recommendations.

Editor: @ttimbers

Reviewer: Lesley Mai, Calvin Choi, Anna Czarnocka, Charles Benkard

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer: calvinyhchoi

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well-organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well-known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Time spent reviewing: ~1.5 hours Comments I think you guys have a very well-put-together report and repository. Coming from someone who has a tough time applying the course material to this project, I was super impressed by the overall robustness of the project. Take my comments with a grain of salt as they are a bit more nitpicky.

I think overall this project is fairly well documented, on the report what would help the readability and flow would be to maybe bold or separate your key points such as the main question, key conclusions etc. I know that always does the job to please readers and reviews to have key ideas pop out of the rest of the writing.
I think the functions in src are very well organized, I think what would improve the interpretability of the script would be to also add the purpose statement, parameters and outputs of the scripts code. This is fairly specific but I think that does help the interpretability of the scripts codes, also maybe cleaning up redundant files like .nojekyll.
For me, visualizations are huge and help a lot with providing context to whatever I am reading, I think it would have been nice to have more visualizations in the EDA as a part of the introduction to show what you were the data looked like and what would be anticipated before the analysis.

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @lesleymai

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well-organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well-known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3h

Review Comments:

Hi Group 12! Congratulations on the impressive work for this comprehensive analysis. Your thorough approach to understanding and predicting German credit risk shows your analytical skills and advances the field.

Strengths Noted:
1. The narrative was clear and concise, and the visuals were eye-catching; the use of heatmaps, histograms, and ROC curves made the story more engaging and easier to understand.
2. Your analysis is enhanced by delving into potential biases and the ethical implications of your modelling choices. This demonstrates a commendable dedication to responsible data science and adds depth to your work.
Improvements Noted:
1. A section outlining the preprocessing steps, including the reasoning for selecting one-hot encoding for numerical variables and standardization for categorical variables, would substantially improve the report's readability.
2. It would be very beneficial to include a more in-depth analysis of the selection of evaluation metrics, particularly as they pertain to the real-world consequences of false positives and false negatives in assessing credit risk.
3. While moving from logistic regression to Random Forest is a well-done job, the reasons why Random Forest was chosen over other models, including logistic regression, are not thoroughly examined. To help the reader better understand, it would be helpful to provide more details about the models' relative performances and a thorough explanation of why the Random Forest model was chosen. A more complete picture of the decision-making process could be provided by a section that compares the performances of the models, taking into account any trade-offs.
4. You've mentioned the ethical considerations and possible biases in your modelling choices with care. Although excellent the study might be improved by exploring further into the ways these biases can be reduced and the possible influence of ethical concerns on the implementation of models. Consider using more diverse datasets or alternative modelling techniques to address or mitigate the biases identified in your study. It would be insightful to expand on this suggestion in future research. It might be helpful to include a discussion of possible precautions or ethical standards for using these models in real-world situations.

All the best! :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @AnnaCzarnocka

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well-organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2h

Review Comments:

Strong points:

Original, interesting and professional model and topic idea. I think this project will suit very well for your future portfolio when applying to various companies.
The repository structure is logical and user-friendly, with clear separations between data, scripts, and documentation. For instance, the scripts folder contains well-named Python files like model.py, which intuitively suggests content related to model building. This organization enhances readability and navigability.
The README.md documentation file is comprehensive, detailing the purpose, installation, and usage of the project. The inclusion of dependencies is a best practice, ensuring that anyone trying to replicate the analysis can do so without ambiguity.
Automated test scripts found in the tests directory are indicative of a mature development process. For example, test_data_preprocessing.py is designed to ensure preprocessing steps are producing the expected transformations.

Areas for Improvement:

The data directory is a core part of your project repository, and you have a lot of data files stored there. Therefore, to enhance clarity, you could consider adding a README file within the 'data' directory, detailing the structure of the data, the source of the data, and any preprocessing steps that have been applied. This would greatly aid reproducibility and help fellow researchers understand how to work with the data in their analyses.
The test suite appears for mw to be surely a strong point; however, ensuring consistent test coverage across all features would be ideal. For example, the script column_histogram.py could be accompanied by more comprehensive tests in test_column_histogram.py, checking for edge cases and unexpected inputs.
While the code is well-structured, more in-depth inline documentation within scripts like data_preprocessing.py would greatly benefit users. Explaining the "why" behind a code block helps users and contributors understand the reasoning and could assist in debugging or further development.
Some scripts, like model.py, could benefit from breaking down larger functions into smaller, more focused functions. This not only helps in understanding and maintaining the code but also facilitates unit testing. For instance, if model training and evaluation phases are separated into distinct functions, they can be tested and debugged more efficiently. Furthermore, if there's a function that performs model fitting and evaluation, you could consider separating these into fit_model() and evaluate_model().
Your src directory is neatly organized, which is commendable. However, it would be beneficial to include a README file that provides an overview of the scripts contained within, along with a brief description of each script's purpose. This would serve as a guide for users navigating through the codebase, improving the overall understandability of your project.
Similarly, in the tests directory, it would be constructive to include a document explaining the testing strategy employed, alongside a summary of the results. To improve, you could add a README that describes what each test script is intended to check and perhaps include sample output of the tests running successfully. For example, you might have a script test_data_integrity.py which could include assertions to ensure no data leakage has occurred or that data types are consistent post-cleanup. Documenting this will give other developers and reviewers confidence in the stability and reliability of the codebase. For example, in your test_data_preprocessing.py, you could include a docstring at the beginning of each test function: def test_missing_values_handled(): """ Test if missing values in the dataset are filled or handled correctly. This test ensures that there are no NaNs in the dataset after preprocessing. """
Your script cleaning_and_eda.py appears to handle data cleaning and exploratory data analysis, which are crucial steps. A recommendation for improvement would be to include more detailed inline comments explaining the rationale behind each data cleaning step and EDA process. For instance, if you're replacing missing values or encoding categorical variables, provide a brief explanation within the code comments on why you chose the specific method. This helps to understand the decisions that influenced the data preprocessing pipeline.
It's great to see that your Random Forest model performs well. However, in model.py, consider providing additional commentary or a supplementary document detailing how you interpret the model's decisions, especially important feature importances which drive the model's predictions. This would enhance the transparency and trustworthiness of the model, crucial for models used in financial decision-making. As an example, after fitting the model, include code that plots feature importances and adds a discussion in the comments or accompanying document on how each feature might be affecting credit risk predictions.

Summary

In summary, I'm really glad with how this project looks at the moment, it satisfies all of the course requirements for this project is a great piece of professional teamwork, and can serve high purposes! I think that implementing at least one of these improvements from above would make it even better.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024