Open renee-kwon opened 1 year ago
Your project is very concise, it has all the sections and the plots are well suited for the problem you are solving. Below are some few points which can help improve your project which is already great!
First: I think it is important to change the output of the Rmd file in the doc folder to be in github_document so that you get and md file which will render well. You have an html file but on Github, it does not render well. As an outsider , I do not know how you report looks like visually (your graphs, tables...); i have to go to the folders to see them. So my suggestion is to change the output from html_document to github_document:
Second: If I want to run the script to produce the report locally, I will not be able to do so because it is not provided. It would be nice to have the script for also producing the report.
Third: I think it would be great in the project description, to expand more on the important of why credibility is important and to give some statistics and explain more how machine learning can be used to solve the credibility issue. Parts of it have been included in the research question which is great but I think it would be better if you move some of the explanations from the research question to the about section and expand more on it. I think the research question should come in a summarize way.
Lastly: I think it is great to have the result folder of the project outside the data folder. I think the data folder should only contain the raw and clean data. I think the result of your project should be easily accessible to the reader.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Great project!! I thought the analysis was well done and thought out. I specifically enjoyed the discussion on Type 1 and 2 errors using the final estimator. I only have a few comments about the report/analysis and any discussion is very much welcome!
report.Rmd
to include both html
and github_document
so that a quick glance in GitHub lets you check the report without having the clone the entire repo. Also, the README.md
usage section could use code blocks (with 3 backticks) to allow users to copy and paste any commands and specifying where (which directory) the user should run the scripts is important!eda_results
folder.Logistic Regression
model was chosen over another (such as Ridge
) as the final estimator. On a similar note, there were a lot of models being tried but the research question specified looking at feature importances so I was wondering why certain models like kNN were chosen to be compared since they don't have interpretable feature importances.This was derived from the JOSE review checklist and the ROpenSci review checklist.
Overall, the project is well thought out, organized, and thorough. The scripts are well documented and easy to read, and the report does a good job of taking the reader through the steps of your analysis. Very well done!
The link to the final report provided in the ReadMe (https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/Makefile_report_hw/doc/report.md) leads to a 404 error. I believe you want to use this link instead: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.md
The methods section and EDA in the report are very well-detailed, making it easy to understand the data set you are working with and the steps of your analysis.
It may not be common knowledge what "default" means in terms of credibility, so it may be helpful to the reader to clearly define what is meant by the "default" class in the report. Does "default" mean that the person is credible or uncredible?
The tables and plots are explained well in the discussion, but it can be useful to also include table/figure captions that summarize what they are showing. I see that you have figure and table captions specified in your code chunk options in the report.Rmd file, but they seem to be missing in the rendered report.md and report.html files.
The discussion of the results and suggested improvements are very insightful and show that the model, despite its limitations, is promising with further imporvements. Well done!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
We have received valuable feedback from peers as well as the teaching team for our project. On behalf of our group members, I would like share the improvements we have been able to make in our project.
makefile
to incorporate the restructuring. This is the link to these changes committed in our project repository.assert
statements to ensure the inputs for the functions are in proper format. We have also added comments and documentation to ensure these parts of the code are understandable. One such improvement of our code is documented in this commit.R
code in our report. We have made the recommended edits in the following commit where we have updated the script to write analysis results and scores in a separate file, which we then
use with inline commands in the report. The incorporated changes can be verified via this commit.environment.yaml
file in out repo to create an environment that should suffice for a user to reproduce the project. We have fixed the package names and versions. The command line usage section has also been updated accordingly. The project has come a long way since, and some of the commits addressing these issues are linked here, here, and here.Thank you again for all the help and feedback. We have learnt a great deal from you.
Submitting authors: @ ChesterAiGo @hcwang24 @qurat-azim @renee-kwon
Repository: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13 Report link: https://github.com/UBC-MDS/Credit_Card_Default_Prediction_Group13/blob/main/doc/report.html Abstract/executive summary: In the field of risk management, one of the most common problems is default prediction. This allows companies to predict the credibility of each person, analyze the risk level and optimize decisions for better business economics. In this project, we aim to learn and predict a credit card holder's credibility based on his/her basic personal information (gender, education, age, history of past payment etc. ).
Our final classifier using the Logistic Regression model did not perform as well as we hoped on our unseen test data, with a final f1 score of 0.471. Of the 6,000 clients in our test data, our model correctly predicted the default status of 4,141 clients correctly. There were 1,129 incorrect predictions, either predicting a customer will default on their payment when they have not or a customer will not default when they have. Incorrect predictions of either type can be costly for financial institutions and thus we will continue to study our data and improve our model before it is put into production.
We use a dataset hosted by the UCI machine learning repository. Originally it is collected by researchers from Chung Hua University and Tamkang University. As the probability of default cannot be actually acquired, the targets are obtained through estimation as stated by the authors of this dataset. The dataset consists of 30000 instances, with each consists of 23 attributes and a target. The raw dataset is about 5.5 MB large, and we split it into the training set (80%) and testing set (20%) for further use. The data attributes range from client's gender, age, education, previous payment history, credit amount etc.
Editor: @flor14 Reviewer: Flora Ouedraogo, Angela Chen, Wilfred Hass, Maryan Agyby