Open joshsia opened 2 years ago
1.5hrs
EDA
Comments about report:
Data Folder
README.md
Non-coding document:
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1.5
covid_socioeconomic_report.md
and the links to them are broken.get_data.R
file in your src
folder that doesn't appear to be used - I think it should probably be removed.testhat
in your list of dependencies.This was derived from the JOSE review checklist and the ROpenSci review checklist.
1 hour and 15 mins
While most of the feedback has been captured by the previous reviewers, I noticed the following:
Good choice of topic and the associated question! It is very relevant to the present-day situation and may be used for further analysis. Good luck for future milestones!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Overall, I think the report is well structured and the analysis is easy to follow from the EDA to the presentation of the results. The plots sync well with the narrative and help guide the reader through the analysis. The topic itself is very relevant as and also difficult to model - it's great to see this being tackled. Good job!
Here is some of my feedback which hopefully will be useful to further your analysis. Please do ignore points which you feel are irrelevant or too minute:
1) There is a time out error when running make all
or src/get_kaggle_data.R
to download the data. I have Kaggle API set up on my system. I am not sure if I missed something locally, so I apologize if this is a false flag.
2) You can provide some examples for the "wide variety of challenges" brought about by COVID mentioned in the introduction. Additionally, statistics from previous studies can be included to further contextualize the issue.
3) In general the discussion on the treatment of data is excellent. However, there is no mention of missing data and how it was dealt with, if present.
4) For Table 5
and Table 6
, the column headings of the tables in the EDA section can be made readable. It is a little difficult to interpret the 'CHR' part of percent_unemployed_CHR
. Also, the both tables are displaying the top and bottom 6 counties rather than 5. The ID can also be removed from Table 6
.
5) Since the distributions and later analysis deal with the numerical features, it may be worthwhile to list the number of numeric features under the data section in addition to the total number of selected features.
6) It will be really helpful to provide the general form of the linear regression model used before Figure 2
. This will make it easier to understand the plots and the discussion of the coefficients that follow. I do see that in a previous section it was mentioned that a linear regression model with interaction is being used.
7) In Figure 2
, I see that some of the features have outliers which are expanding the range of the x-axis. Since the key takeaway here is the slope of regression line, it might be better to zoom in on the plot and emphasize the difference in slopes.
8) In Table 9
, rather than looking at the coefficients using a random sample, it may be better to order the coefficients by significance to get a true measure of all the significant coefficients. This will also emphasize the most important social determinants.
9) The discussion section mentions some very relevant points. However, the comparison between the coefficients should be explained in a little more detail as the 'normalizing' mentioned does not indicate how the data was treated (scaling?). If this is the case, it may be better to provide an explicit plot showing the coefficients after scaling the features. As the magnitudes of the features are very different, this is really important for comparisons to be made.
10) It will be great to explicitly mention the most relevant features to the model in the results section and possible interpret them for further discussion.
Overall great job guys, it was really nice to review your work! Best of luck for the future milestones!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Hello!
Thank you very much for your detailed feedback.
Here is the list of feedback we agreed with and how we responded to them:
Feedback 1: Redundancy of get_data.R
script (@Jacq4nn, @NikitaShymberg)
Response 1: Deleted get_data.R
script
Feedback 2: Visualisations did not show up in report (@Jacq4nn, @NikitaShymberg)
Initially, we used the here
package to specify the file path to our images. However, it seemed like the plots were not rendering in the report due to this. After changing the file paths to relative paths, the problem was fixed.
Response 2: Used relative paths instead of here
package to specify figures in the report
Feedback 3: No information about how missing data was handled (@miyer26) Response 3: More detailed information about how features were transformed has been added under the Analysis section of the report
Feedback 4: No function documentations in the scripts (@NikitaShymberg) Response 4a, b: Included function documentation in relevant scripts
Feedback 5: License was not copyrighted to the correct authors (TA, Milestone 1) Response 5: Added author names in license
Feedback 6: Question was not specific enough in terms of the explanatory variables to be used (TA, Milestone 1) Response 6: Added examples of the explanatory variables to be used
Feedback 7: Insufficient interpretation of EDA figures (TA, Milestone 2) Response 7: Added more explanations about the results of EDA and our interpretation
Feedback 8: Empty data/processed directory and unclear instructions (@Jacq4nn) Response 8: Added clearer instructions to refer to the README file under the Usage section and added processed data file to repository
Feedback 9: Time out error when downloading data file (@miyer26)
Response 9: Improved implementation of get_kaggle_data.R
script by extracting zipped files immediately instead of reading the data file first
Submitting authors: @joshsia @morganrosenberg50 @alexYinanGu0
Repository: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county Report link: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county/blob/main/doc/covid_socioeconomic_report.md Abstract/executive summary: Here we attempt to build a multiple linear regression model which can be used to quantify the influence of potential factors on COVID-19 prevalence (measured by cases per 100,000) across US counties. Our final model suggests that the percentage of smokers, teenage birth rate and chlamydia rate are the three features most strongly associated with COVID-19 prevalence. However, the features selected in the model were chosen arbitrarily from 200 possible features in the original dataset. Thus, more work needs to be done to explore the association of other socioeconomic features on COVID-19 prevalence.
The data set used in this project contains county-level data on health, socioeconomics, weather, and COVID-19 cases compiled by John Davis. It can be found here, specifically, the
US_counties_COVID19_health_weather_data.csv
file. Each row in the data set represents a date corresponding to the number of COVID-19 cases in the county, as well as other features about the county (e.g. smokers percentage, population, income ratio, etc.).Editor: @joshsia @morganrosenberg50 @alexYinanGu0 Reviewer: Mukund Iyer, Nikita Shymberg, Jacqueline Chong, Moid Mohammed