Submission: GROUP_29: US-Salary-Prediction

Submitting authors: @AndyYang80 @cuthchow @lirnish

Repository: https://github.com/UBC-MDS/US-Salary-Prediction Report link: https://github.com/UBC-MDS/US-Salary-Prediction/blob/main/doc/final_report.md Abstract/executive summary: One of the most important things in the job search is about the salaries, specifically, does this job's salary meet our expectations? However, it is not that easy to set proper expectations. Setting an expectation too high or too low will both be harmful to our job search. Here, this project is to help you to answer this question: What we can expect a person's salary to be in the US? According to Martín et al. (2018), a linear regression model with an R2 score is a good combination for predicting salaries, so we will use that to do the prediction. In the process, we wish to understand which factors provide the most predictive power when trying to predict a person's salary. The dataset we are analyzing comes from a salary survey from the "Ask a Manager" blog by Alison Green. This dataset contains survey data gathered from "Ask a Manager" readers working in a variety of industries, and can be found here.

Editor: @flor14 Reviewer: Chen_Xiaohan, Hovhannisyan Lianna, Akella Lakshmi Santosha Valli, Sia_Joshua

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @joshsia

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

You have a "Results and Discussion" subsection inside your "Results & Discussion" section in the final report. This might be a bit confusing for readers.
The findings are interesting and well presented, but it might be nice to have a separate section to conclude your work so that the report structure is clearer.
The scripts in the src directory are not named consistently (some scripts use camel case while others use underscores).
I'm not sure about this but it seems a bit strange to me that you filter extreme ends of annual_salary values in the train set, but leave them in the test set and then justify not using MAPE because there might be zero annual_salary values in the test set. What do you think about filtering extreme ends of annual_salary in the entire data set as part of data wrangling before the analysis?
I'm also not sure about this but I don't understand why you define the ordering for ordinal variables with missing_value at the bottom. Maybe imputation would be useful in this scenario if there are not a lot of missing values?
Regarding the bar plots for how_old_are_you, years_of_experience_in_field etc., it might be nice to order the y-axis by the same ordinal encoding used in the fit_transform_evaluate_model.py script. Currently, the plot for how_old_are_you might be a bit misleading because it shows a gradual increase in median annual salary but the age groups are not in increasing order.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @liannah

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The question is well presented, however no introduction or abstract is given in the report about the background of the problem. I think it will be nice to have a few sentences before Methods sections describing the problem in detail: why it is important to set the expectations about salary in the right range when one searches for job.
The picture figures do not have captions or figure names, thus making it hard to understand the visualization. Even though, the text before or after the figure discusses the figure's purpose/ findings quite well, the caption would have made it easier to grasp from the first glance. In addition, there is a title for tables, so I think it will be good if captions for figures will be included as well (to be consistent with the style) in the report.
I am no expert here, but I think the flow of the report was a little confusing. Some statements such as "We noticed that there are lots of null values in the additional information features (additional_context_on_job_title, additional_context_on_income, etc), and some of the variables have a lot of unique values." are made in the report, but no solution is clearly provided. It made me wonder as a reader, what you decided to do with such values. May be I am missing something, but I feel it will be nicer to have a problem/solution in the same paragraph.
Overall, the explanation of the model tuning and hyperparameter optimization is very well explained, only the part where you mention that the MAPE will not be a good metric seems to be redundant, as while reading it, I started to question why MAPE specifically, why not RMSE is discussed in that case.
The structure of the project is well organized, however in the doc/ folder, I do not really understood what the files data.Rmd and model_results.Rmd represented. They seemed to be draft chunks of the final report, but I might be wrong. I think it will be better to remove them from the directory, if they have no purpose.
Also I noticed that you do not have license section in the readme file. I am not sure what is the protocol for that, but I think it will be good if you include it there, as it will be more straightforward what license you are using.
The scripts are well written with comments, however tests are not present in any of the scripts.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: CHEN_Xiaohan (@anthea98)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

About the report structure and writing quality

The report is well down, but I think an introduction about the dataset may be better to put in a separate subtitle under aim and salary.
Also, since it is a prediction problem, it may be better to put a separate subtitle of the conclusion under results and conclusion to clarify your predict score.
The plots are easy to read and include a title, but it can be better with a caption.
It is not that important but will be better to have a unit for annual salary in plots ($ or something else)
About the data process and model chosen
I noticed you delete the annual salary within the training dataset of fewer than 10,000 USD or over 1,000,000 USD was removed, what does the standard come from? Maybe include a value count to show how many examples are out of your range and data less than 10,000 USD or over 1,000,000 USD are real outliers.
I am a little bit confused about the choice of score, why there will be 0 in test data, and why MAPE can not work then.
A explanation of this model's limitation may be required.
About the whole repo
Very well structured~ the project structure is quite easy to understand in my opinion!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Valli Akella ( @valli180 )

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 Hour

Review Comments:

The question, the purpose and the procedure followed were quite impressive and well structured but adding little more information on the background would have added more meaningful insight to the question posed.
More detailed information about the data in lines of the columns in the table and their description anywhere in the repo would have made the life of the reader easier.
Understanding the graphs would have been easier if they had titles
Statements on how the insights drawn from the graphs helped their modelling approach would have made sense to the EDA.
Though reasoning for selecting the scoring metrics was given, the rationale behind choosing those particular models chosen was not clear and evident and substantiated by proper evidence from coding.
Extensive work on hyperparameter optimization is appreciated.
Suggest exploring the outcomes from other averaging and stacking regressor models or proper in-depth feature engineering would help make proper predictions.
Final conclusion could have been more extensive and elaborate.

Overall: Better flow and comprehendible transition from one part to another would have been achieved if extra sections for the introduction on data, model selection and final predictions in addition to existing sections.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Project Improvements

Feedback 1:

Feedback: Comment 8 in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/26#issue-1066684949

Based on the reasoning, the original question can be in a smaller scale, instead of anyone's salary, maybe of a certain city. It could be easily guessed that age comes in as a factor, but indeed the actual variable of interest is experience which is not a feature here that is quantitative (years of work).

Next steps should be more clearly stated to address some concerns about your project question and further analysis of the data

Action: Expanded the scope of the project to include a random forest model and investigation of parameter importance: https://github.com/UBC-MDS/US-Salary-Prediction/commit/809035d49a1551e07ee42c2968543e6e1db76502

Feedback 2:

Feedback: Comment 1C in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/36#issue-1073001929

I think there needs to be some summary about how your data isn't giving you the right result or if your model isn't working properly. The advancement from milestone 1 to 2 isn't obvious. It's okay to have a negative result, be the presentation is a bit lacking. Maybe have your results and discussion seperated and try to evaluate your data and then your model results.

Action: Added summary section and stated potential improvements of the model: https://github.com/UBC-MDS/US-Salary-Prediction/commit/ce71fd6b339377f3cb74180eb5c3b3d41c52048e

Feedback 3:

Feedback: Comment 3C in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/36#issue-1073001929

Narrative of analysis and visualization was not present. Lack of in depth analysis of EDA in the proposal readme. Should interpret the plot or give captions. Did not see the evolution of proposal with short excerpts of major findings using your statistical method.

Action: Improved EDA and plot interpretation and better plot interpretation in final report: https://github.com/UBC-MDS/US-Salary-Prediction/commit/182fc40ef5847deac48b38574377d11996e783b4

Improved narrative/storytelling aspect of final report: https://github.com/UBC-MDS/US-Salary-Prediction/commit/ce71fd6b339377f3cb74180eb5c3b3d41c52048e

Feedback 4:

Feedback: Comments 3-4 in this issue https://github.com/UBC-MDS/data-analysis-review-2021/issues/14#issuecomment-982355474

The scripts in the src directory are not named consistently (some scripts use camel case while others use underscores).

I'm also not sure about this but I don't understand why you define the ordering for ordinal variables with missing_value at the bottom. Maybe imputation would be useful in this scenario if there are not a lot of missing values?

Regarding the bar plots for how_old_are_you, years_of_experience_in_field etc., it might be nice to order the y-axis by the same ordinal encoding used in the fit_transform_evaluate_model.py script. Currently, the plot for how_old_are_you might be a bit misleading because it shows a gradual increase in median annual salary but the age groups are not in increasing order.

Action: re-named scripts to have consistent naming and changed the ordinal encoding method to replace missing values with mode: https://github.com/UBC-MDS/US-Salary-Prediction/commit/809035d49a1551e07ee42c2968543e6e1db76502

changed EDA to reorder graph axis: https://github.com/UBC-MDS/US-Salary-Prediction/commit/9a51f45960e1bb4d3bfa860ce39a4c4f7d37ac0a

UBC-MDS / data-analysis-review-2021