DSCI-310-2024 / data-analysis-review-2024

2 stars 0 forks source link

Submission: Group 12: Predictive Modelling: German Credit Risk #12

Open ttimbers opened 3 months ago

ttimbers commented 3 months ago

Submitting authors: Shahrukh Islam Prithibi, Sophie Yang, Yovindu Don, Jade Bouchard

Repository: https://github.com/DSCI-310-2024/DSCI310_Group-12_Credit-Risk-Classification/releases/tag/v2.0.0

Abstract/executive summary:

The goal of our analysis is to classify whether someone is a good or bad credit risk using attributes such as Credit History, Duration, and Residence. Our best-performing model is a Random Forest model. This model gave us an accuracy of 0.8 on unseen data, a decent result compared to the dummy model's accuracy of 0.7. We also obtained a precision score of 0.8, a recall score of 0.95, and F1 Score of 0.87. Our model performs decently well in terms of identifying people who are a good credit risk. However, if this model is to have a hand in real-world decision-making, precision should be improved to minimize classifying poor credit risks as good credit risks (false positives). In addition, more research should be done to ensure the model produces fair and equitable recommendations.

Editor: @ttimbers

Reviewer: Lesley Mai, Calvin Choi, Anna Czarnocka, Charles Benkard

calvinyhchoi commented 2 months ago

Data analysis review checklist

Reviewer: calvinyhchoi

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Time spent reviewing: ~1.5 hours Comments I think you guys have a very well-put-together report and repository. Coming from someone who has a tough time applying the course material to this project, I was super impressed by the overall robustness of the project. Take my comments with a grain of salt as they are a bit more nitpicky.

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lesleymai commented 2 months ago

Data analysis review checklist

Reviewer: @lesleymai

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 3h

Review Comments:

Hi Group 12! Congratulations on the impressive work for this comprehensive analysis. Your thorough approach to understanding and predicting German credit risk shows your analytical skills and advances the field.

All the best! :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

AnnaCzarnocka commented 2 months ago

Data analysis review checklist

Reviewer: @AnnaCzarnocka

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2h

Review Comments:

Strong points:

  1. Original, interesting and professional model and topic idea. I think this project will suit very well for your future portfolio when applying to various companies.

  2. The repository structure is logical and user-friendly, with clear separations between data, scripts, and documentation. For instance, the scripts folder contains well-named Python files like model.py, which intuitively suggests content related to model building. This organization enhances readability and navigability.

  3. The README.md documentation file is comprehensive, detailing the purpose, installation, and usage of the project. The inclusion of dependencies is a best practice, ensuring that anyone trying to replicate the analysis can do so without ambiguity.

  4. Automated test scripts found in the tests directory are indicative of a mature development process. For example, test_data_preprocessing.py is designed to ensure preprocessing steps are producing the expected transformations.

Areas for Improvement:

  1. The data directory is a core part of your project repository, and you have a lot of data files stored there. Therefore, to enhance clarity, you could consider adding a README file within the 'data' directory, detailing the structure of the data, the source of the data, and any preprocessing steps that have been applied. This would greatly aid reproducibility and help fellow researchers understand how to work with the data in their analyses.

  2. The test suite appears for mw to be surely a strong point; however, ensuring consistent test coverage across all features would be ideal. For example, the script column_histogram.py could be accompanied by more comprehensive tests in test_column_histogram.py, checking for edge cases and unexpected inputs.

  3. While the code is well-structured, more in-depth inline documentation within scripts like data_preprocessing.py would greatly benefit users. Explaining the "why" behind a code block helps users and contributors understand the reasoning and could assist in debugging or further development.

  4. Some scripts, like model.py, could benefit from breaking down larger functions into smaller, more focused functions. This not only helps in understanding and maintaining the code but also facilitates unit testing. For instance, if model training and evaluation phases are separated into distinct functions, they can be tested and debugged more efficiently. Furthermore, if there's a function that performs model fitting and evaluation, you could consider separating these into fit_model() and evaluate_model().

  5. Your src directory is neatly organized, which is commendable. However, it would be beneficial to include a README file that provides an overview of the scripts contained within, along with a brief description of each script's purpose. This would serve as a guide for users navigating through the codebase, improving the overall understandability of your project.

  6. Similarly, in the tests directory, it would be constructive to include a document explaining the testing strategy employed, alongside a summary of the results. To improve, you could add a README that describes what each test script is intended to check and perhaps include sample output of the tests running successfully. For example, you might have a script test_data_integrity.py which could include assertions to ensure no data leakage has occurred or that data types are consistent post-cleanup. Documenting this will give other developers and reviewers confidence in the stability and reliability of the codebase. For example, in your test_data_preprocessing.py, you could include a docstring at the beginning of each test function: def test_missing_values_handled(): """ Test if missing values in the dataset are filled or handled correctly. This test ensures that there are no NaNs in the dataset after preprocessing. """

  7. Your script cleaning_and_eda.py appears to handle data cleaning and exploratory data analysis, which are crucial steps. A recommendation for improvement would be to include more detailed inline comments explaining the rationale behind each data cleaning step and EDA process. For instance, if you're replacing missing values or encoding categorical variables, provide a brief explanation within the code comments on why you chose the specific method. This helps to understand the decisions that influenced the data preprocessing pipeline.

  8. It's great to see that your Random Forest model performs well. However, in model.py, consider providing additional commentary or a supplementary document detailing how you interpret the model's decisions, especially important feature importances which drive the model's predictions. This would enhance the transparency and trustworthiness of the model, crucial for models used in financial decision-making. As an example, after fitting the model, include code that plots feature importances and adds a discussion in the comments or accompanying document on how each feature might be affecting credit risk predictions.

Summary

In summary, I'm really glad with how this project looks at the moment, it satisfies all of the course requirements for this project is a great piece of professional teamwork, and can serve high purposes! I think that implementing at least one of these improvements from above would make it even better.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.