Open PhilsChan opened 2 years ago
Overall, the project was well structured and ran smoothly, with the final report containing all the required sections and a complete description of the project's purpose, background, methodology, data, and results. Great job! Here are some small suggestions, but they probably go beyond the requirements of the milestone's expectations, just to make the whole project look perfect.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
I really enjoyed reviewing your repo. Just a few comments.
It would be better if you can put more background information in README.md (You have good introduction and background in final report. Maybe you can use some of those). I had a feeling that the README.md just jumped to conclusion right away without sufficient introduction or background.
In README and Report, you are using the number of samples for your explanation. For example, you mentioned "Our classifier was able to correctly predict 13524 examples out of 16281 test examples" or "The training dataset consists of 32561 examples, while the testing set has 16281 rows". It would be more helpful for readers if you also represent the data in percentage.
The Report link (https://ubc-mds.github.io/census-income-prediction/doc/report.html) above is not inside your group repo (https://github.com/UBC-MDS/census-income-prediction). What about changing the Report link to here (https://github.com/UBC-MDS/census-income-prediction/blob/main/doc/report.md)
Since we learned SHAP this week, how about applying SHAP to your analysis.
I forked your repo, and checked if Makefile worked. However make all
didn't work for me (even after creating virtual env with your yaml file). I also tried the series of the script in README. That didn't work either. Maybe the problem might be on my side, but I also recommend you to check it on your side as well.
It will be helpful for readers like me if you put the command for creating virtual env in README: conda env creat -f census-income.yaml
This was derived from the JOSE review checklist and the ROpenSci review checklist.
This is overall a great project and I love the topic that you have chosen! Some highlights of the project that I really like:
Some comments that I feel like would make this project even better:
For the GitHub repo:
src/eda_script.py
, src/model_building.py
and src/model_evaluation.py
, they can be easier to read by splitting into separate functions based on the comments there.ipynb
and .py
files in the same repo src
, but since they serve for clarification, I think it can be reasonable too to keep them if they are not serving for the same function or repeating the code between each other.doc/.DS_Store
and .gitignore
to reduce any possible confusion for the readersFor the report:
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Your project is very interesting and well organized. I enjoyed going through it. I do not have a lot of things to say about it but I believe that in everything there is always a room for improvement. You can find bellow some detailed comments about your project.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thank you all for your constructive feedback. We really appreciate your valuable comments on our project. As suggested, we have incorporated the following changes:
Hope the above changes address your concerns. Again, we are grateful for your feedback in helping us improve the project quality.
Submitting authors: @PhilsChan @nd265 @sukhleen999 @Affrin101
Repository: https://github.com/UBC-MDS/census-income-prediction Report link: https://ubc-mds.github.io/census-income-prediction/doc/report.html Abstract/executive summary: Here we attempt to build a classification model using the Random Forest Classifier algorithm (Liaw and Wiener 2002) which can use the census income data with demographic features such as level of education, age, hours dedicated to work, etc to predict whether a person’s annual income will be greater than 50K or not. Our model was able to correctly predict 13524 examples out of 16281 test examples. Our classifier performed fairly on unseen test data with an ROC AUC score of 0.89, indicating that it is able to distinguish the positive class (income > 50k) with 0.89 probability. The average precision score of our model on the test data is 0.70 and recall is close to 0.71, indicating that among the people whose income is actually >50K, we identified 70% of them correctly and among all the people who earned more than 50K, we were able to predict 71% of them correctly. However, it incorrectly predicted 1042 examples as false positives. These kinds of incorrect predictions could lead people into believing that they can earn more than 50K by following some other career path which might not be favourable for them, thus we recommend continuing the study to improve this prediction model before it is put into production.
Editor: @flor14 Reviewer:
Guo Simon @y248guo
Lee John @max780228
Song Qingqing @scarlqq
Ouedraogo Flora @florawendy19
[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.