UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP 01: Census Income Prediction #6

Open PhilsChan opened 2 years ago

PhilsChan commented 2 years ago

Submitting authors: @PhilsChan @nd265 @sukhleen999 @Affrin101

Repository: https://github.com/UBC-MDS/census-income-prediction Report link: https://ubc-mds.github.io/census-income-prediction/doc/report.html Abstract/executive summary: Here we attempt to build a classification model using the Random Forest Classifier algorithm (Liaw and Wiener 2002) which can use the census income data with demographic features such as level of education, age, hours dedicated to work, etc to predict whether a person’s annual income will be greater than 50K or not. Our model was able to correctly predict 13524 examples out of 16281 test examples. Our classifier performed fairly on unseen test data with an ROC AUC score of 0.89, indicating that it is able to distinguish the positive class (income > 50k) with 0.89 probability. The average precision score of our model on the test data is 0.70 and recall is close to 0.71, indicating that among the people whose income is actually >50K, we identified 70% of them correctly and among all the people who earned more than 50K, we were able to predict 71% of them correctly. However, it incorrectly predicted 1042 examples as false positives. These kinds of incorrect predictions could lead people into believing that they can earn more than 50K by following some other career path which might not be favourable for them, thus we recommend continuing the study to improve this prediction model before it is put into production.

Editor: @flor14 Reviewer:

scarlqq commented 2 years ago

Data analysis review checklist

Reviewer: @scarlqq

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall, the project was well structured and ran smoothly, with the final report containing all the required sections and a complete description of the project's purpose, background, methodology, data, and results. Great job! Here are some small suggestions, but they probably go beyond the requirements of the milestone's expectations, just to make the whole project look perfect.

  1. The flowchart in the README.ME makes the running order of the project very clear, but since we have the makefile now, maybe we can use the tool makefile2graph to make a dependency diagram for our data analysis projects from Makefile.
  2. In the EDA section, I found that some charts still have symbols like '_' in the axis labels, maybe we can define the title to make the charts more human-readable.
  3. In the feature transform section, maybe a table could be made to record what transformation was done to which feature (e.g., feature name, transformation, simple reason) so that it would be easier to read.
  4. In the result section, would it be better to use '<= 50k' '> 50k' as the label in the confusion matrix? This would be easier to read than positive/negative.
  5. In Further Development, besides changing models, maybe we can try stacking to use multiple models together to achieve better results.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

johnwslee commented 2 years ago

Data analysis review checklist

Reviewer: max780228

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

I really enjoyed reviewing your repo. Just a few comments.

  1. It would be better if you can put more background information in README.md (You have good introduction and background in final report. Maybe you can use some of those). I had a feeling that the README.md just jumped to conclusion right away without sufficient introduction or background.

  2. In README and Report, you are using the number of samples for your explanation. For example, you mentioned "Our classifier was able to correctly predict 13524 examples out of 16281 test examples" or "The training dataset consists of 32561 examples, while the testing set has 16281 rows". It would be more helpful for readers if you also represent the data in percentage.

  3. The Report link (https://ubc-mds.github.io/census-income-prediction/doc/report.html) above is not inside your group repo (https://github.com/UBC-MDS/census-income-prediction). What about changing the Report link to here (https://github.com/UBC-MDS/census-income-prediction/blob/main/doc/report.md)

  4. Since we learned SHAP this week, how about applying SHAP to your analysis.

  5. I forked your repo, and checked if Makefile worked. However make all didn't work for me (even after creating virtual env with your yaml file). I also tried the series of the script in README. That didn't work either. Maybe the problem might be on my side, but I also recommend you to check it on your side as well.

  6. It will be helpful for readers like me if you put the command for creating virtual env in README: conda env creat -f census-income.yaml

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

y248guo commented 2 years ago

Data analysis review checklist

Reviewer: @y248guo

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

This is overall a great project and I love the topic that you have chosen! Some highlights of the project that I really like:

Some comments that I feel like would make this project even better:

For the GitHub repo:

For the report:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

florawendy19 commented 2 years ago

Data analysis review checklist

Reviewer: <@florawendy19>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Your project is very interesting and well organized. I enjoyed going through it. I do not have a lot of things to say about it but I believe that in everything there is always a room for improvement. You can find bellow some detailed comments about your project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

sukhleen999 commented 2 years ago

Thank you all for your constructive feedback. We really appreciate your valuable comments on our project. As suggested, we have incorporated the following changes:

Hope the above changes address your concerns. Again, we are grateful for your feedback in helping us improve the project quality.