UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP_29: US-Salary-Prediction #14

Open AndyYang80 opened 2 years ago

AndyYang80 commented 2 years ago

Submitting authors: @AndyYang80 @cuthchow @lirnish

Repository: https://github.com/UBC-MDS/US-Salary-Prediction Report link: https://github.com/UBC-MDS/US-Salary-Prediction/blob/main/doc/final_report.md Abstract/executive summary: One of the most important things in the job search is about the salaries, specifically, does this job's salary meet our expectations? However, it is not that easy to set proper expectations. Setting an expectation too high or too low will both be harmful to our job search. Here, this project is to help you to answer this question: What we can expect a person's salary to be in the US? According to Martín et al. (2018), a linear regression model with an R2 score is a good combination for predicting salaries, so we will use that to do the prediction. In the process, we wish to understand which factors provide the most predictive power when trying to predict a person's salary. The dataset we are analyzing comes from a salary survey from the "Ask a Manager" blog by Alison Green. This dataset contains survey data gathered from "Ask a Manager" readers working in a variety of industries, and can be found here.

Editor: @flor14 Reviewer: Chen_Xiaohan, Hovhannisyan Lianna, Akella Lakshmi Santosha Valli, Sia_Joshua

joshsia commented 2 years ago

Data analysis review checklist

Reviewer: @joshsia

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. You have a "Results and Discussion" subsection inside your "Results & Discussion" section in the final report. This might be a bit confusing for readers.
  2. The findings are interesting and well presented, but it might be nice to have a separate section to conclude your work so that the report structure is clearer.
  3. The scripts in the src directory are not named consistently (some scripts use camel case while others use underscores).
  4. I'm not sure about this but it seems a bit strange to me that you filter extreme ends of annual_salary values in the train set, but leave them in the test set and then justify not using MAPE because there might be zero annual_salary values in the test set. What do you think about filtering extreme ends of annual_salary in the entire data set as part of data wrangling before the analysis?
  5. I'm also not sure about this but I don't understand why you define the ordering for ordinal variables with missing_value at the bottom. Maybe imputation would be useful in this scenario if there are not a lot of missing values?
  6. Regarding the bar plots for how_old_are_you, years_of_experience_in_field etc., it might be nice to order the y-axis by the same ordinal encoding used in the fit_transform_evaluate_model.py script. Currently, the plot for how_old_are_you might be a bit misleading because it shows a gradual increase in median annual salary but the age groups are not in increasing order.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

liannah commented 2 years ago

Data analysis review checklist

Reviewer: @liannah

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Anthea98 commented 2 years ago

Data analysis review checklist

Reviewer: CHEN_Xiaohan (@anthea98)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

About the report structure and writing quality

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

valli180 commented 2 years ago

Data analysis review checklist

Reviewer: Valli Akella ( @valli180 )

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 Hour

Review Comments:

Overall: Better flow and comprehendible transition from one part to another would have been achieved if extra sections for the introduction on data, model selection and final predictions in addition to existing sections.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

AndyYang80 commented 2 years ago

Project Improvements

Feedback 1:

Feedback: Comment 8 in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/26#issue-1066684949

Based on the reasoning, the original question can be in a smaller scale, instead of anyone's salary, maybe of a certain city. It could be easily guessed that age comes in as a factor, but indeed the actual variable of interest is experience which is not a feature here that is quantitative (years of work).

Next steps should be more clearly stated to address some concerns about your project question and further analysis of the data

Action: Expanded the scope of the project to include a random forest model and investigation of parameter importance: https://github.com/UBC-MDS/US-Salary-Prediction/commit/809035d49a1551e07ee42c2968543e6e1db76502

Feedback 2:

Feedback: Comment 1C in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/36#issue-1073001929

I think there needs to be some summary about how your data isn't giving you the right result or if your model isn't working properly. The advancement from milestone 1 to 2 isn't obvious. It's okay to have a negative result, be the presentation is a bit lacking. Maybe have your results and discussion seperated and try to evaluate your data and then your model results.

Action: Added summary section and stated potential improvements of the model: https://github.com/UBC-MDS/US-Salary-Prediction/commit/ce71fd6b339377f3cb74180eb5c3b3d41c52048e

Feedback 3:

Feedback: Comment 3C in this issue https://github.com/UBC-MDS/US-Salary-Prediction/issues/36#issue-1073001929

Narrative of analysis and visualization was not present. Lack of in depth analysis of EDA in the proposal readme. Should interpret the plot or give captions. Did not see the evolution of proposal with short excerpts of major findings using your statistical method.

Action: Improved EDA and plot interpretation and better plot interpretation in final report: https://github.com/UBC-MDS/US-Salary-Prediction/commit/182fc40ef5847deac48b38574377d11996e783b4

Improved narrative/storytelling aspect of final report: https://github.com/UBC-MDS/US-Salary-Prediction/commit/ce71fd6b339377f3cb74180eb5c3b3d41c52048e

Feedback 4:

Feedback: Comments 3-4 in this issue https://github.com/UBC-MDS/data-analysis-review-2021/issues/14#issuecomment-982355474

The scripts in the src directory are not named consistently (some scripts use camel case while others use underscores).

I'm also not sure about this but I don't understand why you define the ordering for ordinal variables with missing_value at the bottom. Maybe imputation would be useful in this scenario if there are not a lot of missing values?

Regarding the bar plots for how_old_are_you, years_of_experience_in_field etc., it might be nice to order the y-axis by the same ordinal encoding used in the fit_transform_evaluate_model.py script. Currently, the plot for how_old_are_you might be a bit misleading because it shows a gradual increase in median annual salary but the age groups are not in increasing order.

Action: re-named scripts to have consistent naming and changed the ordinal encoding method to replace missing values with mode: https://github.com/UBC-MDS/US-Salary-Prediction/commit/809035d49a1551e07ee42c2968543e6e1db76502

changed EDA to reorder graph axis: https://github.com/UBC-MDS/US-Salary-Prediction/commit/9a51f45960e1bb4d3bfa860ce39a4c4f7d37ac0a