Submission: GROUP_16 : Data Science Salary Predictor

Submitting authors: @CChCheChen @xXJohamXx @mikeguron @tanmayag97 Repository: https://github.com/UBC-MDS/Data-Science-Salary-Predictor-DSCI522-Group16-2022 Report link: https://github.com/UBC-MDS/Data-Science-Salary-Predictor-DSCI522-Group16-2022/blob/main/documents/FinalReport.pdf Abstract/executive summary: As we are all current students in the MDS program, a question we have is: where will we end up working after this program is over?? A natural follow up question to this is, how much can we expect to be compensated given our previous experience, target industry, geographic location, etc. Wouldn't it be nice if we could create some sort of model that would help us gain insight into this question? Is there anything we have learnt so far in our program that could shed some light on this conundrum? Well, you have come to the right place! Our group has found a recent and comprehensive dataset processed from the Stack overflow Annual Developers Survey which we will use to build a predictive machine learning model to help answer this burning question that is on our and the rest of our cohort's mind! Read on for a breakdown of our question and an overview of our approach.

Editor: @flor14 Reviewer: Mengjun Chen, Mehwish, Eric Tsai, Vikram Grewal

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Eric Tsai (erictsai1208)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

I like that this is very relevant to the program and the README is well laid out. The final report has clear header sections and is easy to follow. The only thing I would say about organization is that I feel like the images could go in a separate folder, seeing that it does not really make sense to put it all in the documents directory.
You may need to double check some inline R code in the final document, as in the sentence "our model yielded a score of BLANK". I'm assuming the BLANK should be some value.
I feel like the boxplots for yearly compensation distribution having different shades of blue is not very helpful and it makes it difficult to read the median values for some of the plots because it is so light in colour.
I see that you guys reported the final score using $R^2$ scoring method. I think it may also be good to also report the MAPE or RMSE so that it is easier to interpret the result. I think it is understandable that the prediction is not super accurate given the data is not perfect.
It could have been interesting to see a scatterplot between some numeric feature and the target (e.g. between YearsCode and ConvertedCompYearly) in addition to the correlation matrix. Having a visual is much easier to interpret than seeing the numbers.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @xFiveRivers

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 Hour

Review Comments:

The report is well thought out and meaningful, but some sections such as the Dataset section could be formatted better. Instead of having separate lines it might be better to format text in paragraphs so it doesn't feel disjointed. I feel like the features section might have been better formatted as a table as it can be difficult to follow as long lines, and it helps picture the data a little bit easier.
For the multi-selection features, it would have been nice to see the different options for each feature. It would provide the reader with a bit more insight into what the feature actually entails.
A little more detail about what exact transformations were applied to the features and your motivation behind it would have been nice to see. It gives and idea behind the reasonability of the model because everything is based on the feature transformations.
Providing information as to why R2 was your main scoring method would have been good to include as it explains to the reader the significance of the score and specifically why the model was chosen. Also include the raw test and training scores would have been useful as well as we would be able to see the impact of overfitting.
The correlation maxtrix is a nice touch, however having some colour in the table allows the viewer to make insights quicker. Choosing an appropriate colour scheme emphasizes the high correlations from the low.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Mehwish Nabi (@MNBhat)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 Hours

Review Comments:

The read.me file is well thought out and is clear about the question and motivation behind the project. I think that the information about the models used is missing from the file, and adding it will give more information about the approach. Furthermore. the location and the usage of environment file could be mentioned in the Readme file.
The data preprocessing and modelling script have been written in a very understandable and efficient way breaking the code into parts. It is easy to follow with the presence of function docstring. Great job on this one !.
The document folder could be more classified using the separate folders for images of EDA and results to make the understanding of what these files pertain to more evident.
The code of conduct file includes all the points about the enforcement and behavioural expectations. However, I think it would be better to add point of contact(email) in case someone wants to report a complaint.
The final report conveys the information about the features, correlation of features,models used and their performances clearly. The comparison of the models using different score metric is done clearly as well . However, when trying to understand the table of comparisons the usage of column names it could be confusing(e.g using baseline and baseline 1 in table 2 ) ,. I feel like it can be made better and when comparing the models after hyper parameterization adding previous scores to the table can make the comparison more clear.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Mengjun5

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

The readme file is good in general. One suggestion on usage part method 1, instead of putting all descriptions and code of usage together into one cell, it is a better idea to seperate them into piece to piece so others can easily copy the code instead of the whole thing.
The final report seems really good and impressive. However, it is a good idea that include a conclusion section inside your final report instead of just mentioning the performance of your test score.
The data preprocessing and the model selection scripts are great that seperating the entire code into different functions, however, they are too long to read. Why not seperate these two scripts into more scripts which make every script clean and short.
I noticed there are several functions and feature name duplicated in data preprocessing and the model selection scripts, I do not think it is necessary for both scripts containning the same content.
Finally, I noticed there are a environment.yml script and a env-dsci-group-16.yaml script for environment, I wonder which one I should use as a user. Moreover, adding dependencies into readme file is a nesscessary work.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Improvements on Above Feedback

Peer Review Comment 2 from Eric Tsai: added missing value and fixed figure captions :

https://github.com/UBC-MDS/data-science-salary-predictor/commit/b3c267caa8f1d8ec094e0dc207438862b5145724

Peer Review Comment 1 from Vikram Grewal: Update final report feature description section:

https://github.com/UBC-MDS/data-science-salary-predictor/commit/ac272f46f191f8bf1fe13c868b594c0ef4c20873

Peer Review Comment 2 from Vikram Grewal: Provide available options for multiple selection features:

https://github.com/UBC-MDS/data-science-salary-predictor/commit/d8493cdb985820f5181188a5f5a95c50f608ab96

Milestone 2 Issue from TA: increase font size in tables:

https://github.com/UBC-MDS/data-science-salary-predictor/commit/16ff543935c5f196cd5c0a56fc44259e79c9525c

Peer Review Comment 5 from Mehwish Nabi: adjust column names in tables

https://github.com/UBC-MDS/data-science-salary-predictor/commit/16ff543935c5f196cd5c0a56fc44259e79c9525c

UBC-MDS / data-analysis-review-2022