Submission: GROUP 01: Census Income Prediction

Submitting authors: @PhilsChan @nd265 @sukhleen999 @Affrin101

Repository: https://github.com/UBC-MDS/census-income-prediction Report link: https://ubc-mds.github.io/census-income-prediction/doc/report.html Abstract/executive summary: Here we attempt to build a classification model using the Random Forest Classifier algorithm (Liaw and Wiener 2002) which can use the census income data with demographic features such as level of education, age, hours dedicated to work, etc to predict whether a person’s annual income will be greater than 50K or not. Our model was able to correctly predict 13524 examples out of 16281 test examples. Our classifier performed fairly on unseen test data with an ROC AUC score of 0.89, indicating that it is able to distinguish the positive class (income > 50k) with 0.89 probability. The average precision score of our model on the test data is 0.70 and recall is close to 0.71, indicating that among the people whose income is actually >50K, we identified 70% of them correctly and among all the people who earned more than 50K, we were able to predict 71% of them correctly. However, it incorrectly predicted 1042 examples as false positives. These kinds of incorrect predictions could lead people into believing that they can earn more than 50K by following some other career path which might not be favourable for them, thus we recommend continuing the study to improve this prediction model before it is put into production.

Editor: @flor14 Reviewer:

Guo Simon @y248guo
Lee John @max780228
Song Qingqing @scarlqq
Ouedraogo Flora @florawendy19
[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @scarlqq

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall, the project was well structured and ran smoothly, with the final report containing all the required sections and a complete description of the project's purpose, background, methodology, data, and results. Great job! Here are some small suggestions, but they probably go beyond the requirements of the milestone's expectations, just to make the whole project look perfect.

The flowchart in the README.ME makes the running order of the project very clear, but since we have the makefile now, maybe we can use the tool makefile2graph to make a dependency diagram for our data analysis projects from Makefile.
In the EDA section, I found that some charts still have symbols like '_' in the axis labels, maybe we can define the title to make the charts more human-readable.
In the feature transform section, maybe a table could be made to record what transformation was done to which feature (e.g., feature name, transformation, simple reason) so that it would be easier to read.
In the result section, would it be better to use '<= 50k' '> 50k' as the label in the confusion matrix? This would be easier to read than positive/negative.
In Further Development, besides changing models, maybe we can try stacking to use multiple models together to achieve better results.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: max780228

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

I really enjoyed reviewing your repo. Just a few comments.

It would be better if you can put more background information in README.md (You have good introduction and background in final report. Maybe you can use some of those). I had a feeling that the README.md just jumped to conclusion right away without sufficient introduction or background.
In README and Report, you are using the number of samples for your explanation. For example, you mentioned "Our classifier was able to correctly predict 13524 examples out of 16281 test examples" or "The training dataset consists of 32561 examples, while the testing set has 16281 rows". It would be more helpful for readers if you also represent the data in percentage.
The Report link (https://ubc-mds.github.io/census-income-prediction/doc/report.html) above is not inside your group repo (https://github.com/UBC-MDS/census-income-prediction). What about changing the Report link to here (https://github.com/UBC-MDS/census-income-prediction/blob/main/doc/report.md)
Since we learned SHAP this week, how about applying SHAP to your analysis.
I forked your repo, and checked if Makefile worked. However make all didn't work for me (even after creating virtual env with your yaml file). I also tried the series of the script in README. That didn't work either. Maybe the problem might be on my side, but I also recommend you to check it on your side as well.
It will be helpful for readers like me if you put the command for creating virtual env in README: conda env creat -f census-income.yaml

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @y248guo

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well-known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance of this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

This is overall a great project and I love the topic that you have chosen! Some highlights of the project that I really like:

The EDA was in-depth and covered different perspectives to look at the data you are working with
The findings and results were reported in a crystal clear way and very easy to compare between models, and they back up your conclusions very nicely
Repositories are well organized and all the resources are easy to access.

Some comments that I feel like would make this project even better:

For the GitHub repo:

I love the comments you added in the scripts, but one suggestion I have is that it might be a good idea to separate the long main function into smaller functions. It would help with debugging process, and also increase the readability of the code by a lot. Especially for src/eda_script.py, src/model_building.py and src/model_evaluation.py, they can be easier to read by splitting into separate functions based on the comments there
I am not too sure if it is a good idea to have a mix of .ipynb and .py files in the same repo src, but since they serve for clarification, I think it can be reasonable too to keep them if they are not serving for the same function or repeating the code between each other.
Not very important, but try to avoid including irrelevant files in the Github repo such as doc/.DS_Store and .gitignore to reduce any possible confusion for the readers

For the report:

In the "Data" section, the reference was not very clear, as I was not sure which part of your content was referenced from the UCI ML repo.
For the EDA part, it is good to have a correlation plot included, but might be a better idea to also indicate what correlation calculation or metric was used (e.g. Pearson Correlation is commonly used, but other metrics like Spearman’s Rank Correlation, Kendall Rank Correlation and Point Biserial Correlation might be used as well)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: <@florawendy19>

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

Your project is very interesting and well organized. I enjoyed going through it. I do not have a lot of things to say about it but I believe that in everything there is always a room for improvement. You can find bellow some detailed comments about your project.

I think you have the background needs more details in the README file especially. In way that even non tech people can understand what the project is trying to achieve in a simpler manner.
Also, I like the fact that you have a flowchart, it helps to know the flow of the project and I was not lost in your project because i could refer to it.
Also, i think using the ensembles that we have learned in the week 3 in 573 will help make the model more interpretable and and give you more results.
The python files in the src folder codes are in bulk and I think breaking it down will help understanding what part of the code does what. The different codes in the scr folder achieve the intended objective , however, it is also important to make sure that it is clearly understandable by any reader.
The EDA part of your project is well done and clear and the charts are very informative and relevant to the question you are trying to answer in this project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you all for your constructive feedback. We really appreciate your valuable comments on our project. As suggested, we have incorporated the following changes:

Added SHAPing to the analysis (commit link: https://github.com/UBC-MDS/census-income-prediction/pull/68/commits/8fffe89e5ff565c9be083914eabe48af835003fe)
Changed metric of choice to ROC_AUC score as suggested by the TA (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/0197758911d8f2bdd63282c7e083f65a80726c7f)
Changed class names in confusion matrix to '<=50K' and '>50K' (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/d03b6334d7334d6762efd1a2c4f5ba598d660734)
Added correlation heatmap as per the feedback from TA (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/5798884a393cdf41bd1cc9df00fbae157508f115)
Fixed issue with makefile on Windows OS (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/5088b2a677b38981f3a059db131934dfd9d92df7)
Added command to create virtual environment in README.md (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/1bb17cbbb65fb9ed77837e6c9745f036fbac8957)
Updated background on Census dataset (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/b19cf220616b93f234faf1b81972dbed2d7d5916)
Added feature transformation table (commit link: https://github.com/UBC-MDS/census-income-prediction/commit/4f83151429fe167a480522543c4e4e36c0850ab9)

Hope the above changes address your concerns. Again, we are grateful for your feedback in helping us improve the project quality.

UBC-MDS / data-analysis-review-2021