UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 25 - U.S. Social Determinants of Health per County #11

Open joshsia opened 2 years ago

joshsia commented 2 years ago

Submitting authors: @joshsia @morganrosenberg50 @alexYinanGu0

Repository: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county Report link: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county/blob/main/doc/covid_socioeconomic_report.md Abstract/executive summary: Here we attempt to build a multiple linear regression model which can be used to quantify the influence of potential factors on COVID-19 prevalence (measured by cases per 100,000) across US counties. Our final model suggests that the percentage of smokers, teenage birth rate and chlamydia rate are the three features most strongly associated with COVID-19 prevalence. However, the features selected in the model were chosen arbitrarily from 200 possible features in the original dataset. Thus, more work needs to be done to explore the association of other socioeconomic features on COVID-19 prevalence.

The data set used in this project contains county-level data on health, socioeconomics, weather, and COVID-19 cases compiled by John Davis. It can be found here, specifically, the US_counties_COVID19_health_weather_data.csv file. Each row in the data set represents a date corresponding to the number of COVID-19 cases in the county, as well as other features about the county (e.g. smokers percentage, population, income ratio, etc.).

Editor: @joshsia @morganrosenberg50 @alexYinanGu0 Reviewer: Mukund Iyer, Nikita Shymberg, Jacqueline Chong, Moid Mohammed

Jacq4nn commented 2 years ago

Data analysis review checklist

Reviewer: Jacqueline Chong @Jacq4nn

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5hrs

Review Comments:

EDA

Comments about report:

Data Folder

README.md

Non-coding document:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

NikitaShymberg commented 2 years ago

Data analysis review checklist

Reviewer: NikitaShymberg

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

iamMoid commented 2 years ago

Data analysis review checklist

Reviewer: iamMoid

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1 hour and 15 mins

Review Comments:

While most of the feedback has been captured by the previous reviewers, I noticed the following:

Good choice of topic and the associated question! It is very relevant to the present-day situation and may be used for further analysis. Good luck for future milestones!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

miyer26 commented 2 years ago

Reviewer: miyer26

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall, I think the report is well structured and the analysis is easy to follow from the EDA to the presentation of the results. The plots sync well with the narrative and help guide the reader through the analysis. The topic itself is very relevant as and also difficult to model - it's great to see this being tackled. Good job!

Here is some of my feedback which hopefully will be useful to further your analysis. Please do ignore points which you feel are irrelevant or too minute:

1) There is a time out error when running make all or src/get_kaggle_data.R to download the data. I have Kaggle API set up on my system. I am not sure if I missed something locally, so I apologize if this is a false flag.

2) You can provide some examples for the "wide variety of challenges" brought about by COVID mentioned in the introduction. Additionally, statistics from previous studies can be included to further contextualize the issue.

3) In general the discussion on the treatment of data is excellent. However, there is no mention of missing data and how it was dealt with, if present.

4) For Table 5 and Table 6, the column headings of the tables in the EDA section can be made readable. It is a little difficult to interpret the 'CHR' part of percent_unemployed_CHR. Also, the both tables are displaying the top and bottom 6 counties rather than 5. The ID can also be removed from Table 6.

5) Since the distributions and later analysis deal with the numerical features, it may be worthwhile to list the number of numeric features under the data section in addition to the total number of selected features.

6) It will be really helpful to provide the general form of the linear regression model used before Figure 2. This will make it easier to understand the plots and the discussion of the coefficients that follow. I do see that in a previous section it was mentioned that a linear regression model with interaction is being used.

7) In Figure 2, I see that some of the features have outliers which are expanding the range of the x-axis. Since the key takeaway here is the slope of regression line, it might be better to zoom in on the plot and emphasize the difference in slopes.

8) In Table 9, rather than looking at the coefficients using a random sample, it may be better to order the coefficients by significance to get a true measure of all the significant coefficients. This will also emphasize the most important social determinants.

9) The discussion section mentions some very relevant points. However, the comparison between the coefficients should be explained in a little more detail as the 'normalizing' mentioned does not indicate how the data was treated (scaling?). If this is the case, it may be better to provide an explicit plot showing the coefficients after scaling the features. As the magnitudes of the features are very different, this is really important for comparisons to be made.

10) It will be great to explicitly mention the most relevant features to the model in the results section and possible interpret them for further discussion.

Overall great job guys, it was really nice to review your work! Best of luck for the future milestones!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

joshsia commented 2 years ago

Hello!

Thank you very much for your detailed feedback.

Here is the list of feedback we agreed with and how we responded to them:

Feedback 1: Redundancy of get_data.R script (@Jacq4nn, @NikitaShymberg) Response 1: Deleted get_data.R script

Feedback 2: Visualisations did not show up in report (@Jacq4nn, @NikitaShymberg) Initially, we used the here package to specify the file path to our images. However, it seemed like the plots were not rendering in the report due to this. After changing the file paths to relative paths, the problem was fixed. Response 2: Used relative paths instead of here package to specify figures in the report

Feedback 3: No information about how missing data was handled (@miyer26) Response 3: More detailed information about how features were transformed has been added under the Analysis section of the report

Feedback 4: No function documentations in the scripts (@NikitaShymberg) Response 4a, b: Included function documentation in relevant scripts

Feedback 5: License was not copyrighted to the correct authors (TA, Milestone 1) Response 5: Added author names in license

Feedback 6: Question was not specific enough in terms of the explanatory variables to be used (TA, Milestone 1) Response 6: Added examples of the explanatory variables to be used

Feedback 7: Insufficient interpretation of EDA figures (TA, Milestone 2) Response 7: Added more explanations about the results of EDA and our interpretation

Feedback 8: Empty data/processed directory and unclear instructions (@Jacq4nn) Response 8: Added clearer instructions to refer to the README file under the Usage section and added processed data file to repository

Feedback 9: Time out error when downloading data file (@miyer26) Response 9: Improved implementation of get_kaggle_data.R script by extracting zipped files immediately instead of reading the data file first