UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: GROUP_16 : Data Science Salary Predictor #3

Open CChCheChen opened 1 year ago

CChCheChen commented 1 year ago

Submitting authors: @CChCheChen @xXJohamXx @mikeguron @tanmayag97 Repository: https://github.com/UBC-MDS/Data-Science-Salary-Predictor-DSCI522-Group16-2022 Report link: https://github.com/UBC-MDS/Data-Science-Salary-Predictor-DSCI522-Group16-2022/blob/main/documents/FinalReport.pdf Abstract/executive summary: As we are all current students in the MDS program, a question we have is: where will we end up working after this program is over?? A natural follow up question to this is, how much can we expect to be compensated given our previous experience, target industry, geographic location, etc. Wouldn't it be nice if we could create some sort of model that would help us gain insight into this question? Is there anything we have learnt so far in our program that could shed some light on this conundrum? Well, you have come to the right place! Our group has found a recent and comprehensive dataset processed from the Stack overflow Annual Developers Survey which we will use to build a predictive machine learning model to help answer this burning question that is on our and the rest of our cohort's mind! Read on for a breakdown of our question and an overview of our approach.

Editor: @flor14 Reviewer: Mengjun Chen, Mehwish, Eric Tsai, Vikram Grewal

erictsai1208 commented 1 year ago

Data analysis review checklist

Reviewer: Eric Tsai (erictsai1208)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. I like that this is very relevant to the program and the README is well laid out. The final report has clear header sections and is easy to follow. The only thing I would say about organization is that I feel like the images could go in a separate folder, seeing that it does not really make sense to put it all in the documents directory.
  2. You may need to double check some inline R code in the final document, as in the sentence "our model yielded a score of BLANK". I'm assuming the BLANK should be some value.
  3. I feel like the boxplots for yearly compensation distribution having different shades of blue is not very helpful and it makes it difficult to read the median values for some of the plots because it is so light in colour.
  4. I see that you guys reported the final score using $R^2$ scoring method. I think it may also be good to also report the MAPE or RMSE so that it is easier to interpret the result. I think it is understandable that the prediction is not super accurate given the data is not perfect.
  5. It could have been interesting to see a scatterplot between some numeric feature and the target (e.g. between YearsCode and ConvertedCompYearly) in addition to the correlation matrix. Having a visual is much easier to interpret than seeing the numbers.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

xFiveRivers commented 1 year ago

Data analysis review checklist

Reviewer: @xFiveRivers

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 Hour

Review Comments:

  1. The report is well thought out and meaningful, but some sections such as the Dataset section could be formatted better. Instead of having separate lines it might be better to format text in paragraphs so it doesn't feel disjointed. I feel like the features section might have been better formatted as a table as it can be difficult to follow as long lines, and it helps picture the data a little bit easier.
  2. For the multi-selection features, it would have been nice to see the different options for each feature. It would provide the reader with a bit more insight into what the feature actually entails.
  3. A little more detail about what exact transformations were applied to the features and your motivation behind it would have been nice to see. It gives and idea behind the reasonability of the model because everything is based on the feature transformations.
  4. Providing information as to why R2 was your main scoring method would have been good to include as it explains to the reader the significance of the score and specifically why the model was chosen. Also include the raw test and training scores would have been useful as well as we would be able to see the impact of overfitting.
  5. The correlation maxtrix is a nice touch, however having some colour in the table allows the viewer to make insights quicker. Choosing an appropriate colour scheme emphasizes the high correlations from the low.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

MNBhat commented 1 year ago

Data analysis review checklist

Reviewer: Mehwish Nabi (@MNBhat)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 Hours

Review Comments:

  1. The read.me file is well thought out and is clear about the question and motivation behind the project. I think that the information about the models used is missing from the file, and adding it will give more information about the approach. Furthermore. the location and the usage of environment file could be mentioned in the Readme file.
  2. The data preprocessing and modelling script have been written in a very understandable and efficient way breaking the code into parts. It is easy to follow with the presence of function docstring. Great job on this one !.
  3. The document folder could be more classified using the separate folders for images of EDA and results to make the understanding of what these files pertain to more evident.
  4. The code of conduct file includes all the points about the enforcement and behavioural expectations. However, I think it would be better to add point of contact(email) in case someone wants to report a complaint.
  5. The final report conveys the information about the features, correlation of features,models used and their performances clearly. The comparison of the models using different score metric is done clearly as well . However, when trying to understand the table of comparisons the usage of column names it could be confusing(e.g using baseline and baseline 1 in table 2 ) ,. I feel like it can be made better and when comparing the models after hyper parameterization adding previous scores to the table can make the comparison more clear.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Mengjun74 commented 1 year ago

Data analysis review checklist

Reviewer: Mengjun5

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. The readme file is good in general. One suggestion on usage part method 1, instead of putting all descriptions and code of usage together into one cell, it is a better idea to seperate them into piece to piece so others can easily copy the code instead of the whole thing.
  2. The final report seems really good and impressive. However, it is a good idea that include a conclusion section inside your final report instead of just mentioning the performance of your test score.
  3. The data preprocessing and the model selection scripts are great that seperating the entire code into different functions, however, they are too long to read. Why not seperate these two scripts into more scripts which make every script clean and short.
  4. I noticed there are several functions and feature name duplicated in data preprocessing and the model selection scripts, I do not think it is necessary for both scripts containning the same content.
  5. Finally, I noticed there are a environment.yml script and a env-dsci-group-16.yaml script for environment, I wonder which one I should use as a user. Moreover, adding dependencies into readme file is a nesscessary work.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

xXJohamXx commented 1 year ago

Improvements on Above Feedback

Milestone 1 Issue from TA: update header in readme

Peer Review Comment 1 from Mehwish Nabi: added models in readme

Peer Review Comment 3 from Eric Tsai: changed figure color/labels :

Peer Review Comment 1 from Eric Tsai: Change file organization/ file path:

Peer Review Comment 2 from Eric Tsai: added missing value and fixed figure captions :

Peer Review Comment 1 from Vikram Grewal: Update final report feature description section:

Peer Review Comment 2 from Vikram Grewal: Provide available options for multiple selection features:

Milestone 2 Issue from TA: increase font size in tables:

Peer Review Comment 5 from Mehwish Nabi: adjust column names in tables