Submission: Group 25 - U.S. Social Determinants of Health per County

Submitting authors: @joshsia @morganrosenberg50 @alexYinanGu0

Repository: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county Report link: https://github.com/UBC-MDS/DSCI_522_US_social_determinants_of_health_by_county/blob/main/doc/covid_socioeconomic_report.md Abstract/executive summary: Here we attempt to build a multiple linear regression model which can be used to quantify the influence of potential factors on COVID-19 prevalence (measured by cases per 100,000) across US counties. Our final model suggests that the percentage of smokers, teenage birth rate and chlamydia rate are the three features most strongly associated with COVID-19 prevalence. However, the features selected in the model were chosen arbitrarily from 200 possible features in the original dataset. Thus, more work needs to be done to explore the association of other socioeconomic features on COVID-19 prevalence.

The data set used in this project contains county-level data on health, socioeconomics, weather, and COVID-19 cases compiled by John Davis. It can be found here, specifically, the US_counties_COVID19_health_weather_data.csv file. Each row in the data set represents a date corresponding to the number of COVID-19 cases in the county, as well as other features about the county (e.g. smokers percentage, population, income ratio, etc.).

Editor: @joshsia @morganrosenberg50 @alexYinanGu0 Reviewer: Mukund Iyer, Nikita Shymberg, Jacqueline Chong, Moid Mohammed

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Jacqueline Chong @Jacq4nn

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5hrs

Review Comments:

In your SRC folder, there were 2 scripts to get your data. Perhaps you can include comments as to what you are testing, such that it is more readable. I am still unable to ascertain whether the short script 'get_data' is necessary.

EDA

I really liked the structure of this EDA. However, the plots are not rendering. To make it neater, I would suggest that you change the setting in the .Rmd file to echo=False. Lastly, I would recommend that you include text between your tables to describe what I should be looking at (and include table numbers). An example would be, this table outputs the summary of all the features (both numeric and categorical).

Comments about report:

Overall, I think there is a flow and it is mostly easy to follow. I would bold the conclusion of your report such that it is easy to see your conclusion, and follow your argument/train of thought. The quantification of values, such as 'intercept term', or correlation/ feature importance would aid in the understanding of your report.
In your report, your visualisations did not show up in the .md file. Perhaps you need to check the path in your script. Also, there are no titles in your plots. There are also some misspelled words ('relationshipts')
In the EDA part of your report, the captions do not reflect what I see in the table (5 vs. 6 counties). Perhaps, you can order it based on max_cases in descending order also to make it clearer. Also, if the table is meant to highlight the max cases, then perhaps the growth rate is not as relevant here. Lastly, you could look at the indexing, and standardisation of the tables.
For the text for the visualisation part, I would encourage you to describe your analysis rather that just state what you did. i.e. what should I focus on in your plot that will make your analysis more convincing.
Lastly, the report seems to have been written by people with different writing styles. You can try and get a person in the group to read it, such that it flows smoothly.

Data Folder

I am unsure why there is a processed and raw folder with nothing inside. I understand that your files are too huge for the raw file, and perhaps you could add more informative description in your .txt file instead.
For the processed file, I believe the dataset might be small enough to place it in. Else, same advice as above.

README.md

The references did not show up. I think you did not include it in your readme script.
The links for the report (html) are broken.

Non-coding document:

I think your group did a good job in completing this task, and they all appear to be accurate.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: NikitaShymberg

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

I'm afraid that your plots don't show up in the covid_socioeconomic_report.md and the links to them are broken.
I can see that there is a get_data.R file in your src folder that doesn't appear to be used - I think it should probably be removed.
I would like to see a list of all the features that you used and a description of each. This would give me a clear and quick overview of your dataset.
I think that your conclusion is very clear and your suggestions for future improvements are very good.
I didn't see any documentation for your functions. It would be good to add this to make your code easier to understand.
I would like to see more than just 1 or 2 tests for each function. More tests would ensure that your code works correctly and would let you make adjustments in the future without worrying about breaking things.
Your report wasn't verbose at all - it got straight to the point and was brief. I liked this!
I think you forgot to include testhat in your list of dependencies.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: iamMoid

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1 hour and 15 mins

Review Comments:

While most of the feedback has been captured by the previous reviewers, I noticed the following:

There are over 10 models/files/images in the 'results' folder which makes it a little difficult to determine which file is the output of which script. I would suggest adding number prefixes to filenames or reorganizing by creating subfolders.
The EDA.md file is very descriptive. Although the later tables have a caption/title, the first three tables do not seem to have a caption/title defined.
In the 'eda_covid_socioeconomics.R' script, I noticed a potential for saving the tables to RDS by creating a function that takes the dataframe name, output filename as inputs and saves the file to the assigned directory.
I noticed the 'plotly' library being imported in the 'eda_covid_socioeconomics.R' script, however, I did not see it under the 'References' section of the final report or in the 'covid_socioeconomic_refs.bib' bibliography file.
The 'covid_socioeconomic_report.md' report looks nicely structured and well laid out. One observation was that the 'Discussion' section begins with "PLEASE EDIT". Not sure if this was added later on to improve the text but nevertheless it must be removed.

Good choice of topic and the associated question! It is very relevant to the present-day situation and may be used for further analysis. Good luck for future milestones!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Reviewer: miyer26

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall, I think the report is well structured and the analysis is easy to follow from the EDA to the presentation of the results. The plots sync well with the narrative and help guide the reader through the analysis. The topic itself is very relevant as and also difficult to model - it's great to see this being tackled. Good job!

Here is some of my feedback which hopefully will be useful to further your analysis. Please do ignore points which you feel are irrelevant or too minute:

1) There is a time out error when running make all or src/get_kaggle_data.R to download the data. I have Kaggle API set up on my system. I am not sure if I missed something locally, so I apologize if this is a false flag.

2) You can provide some examples for the "wide variety of challenges" brought about by COVID mentioned in the introduction. Additionally, statistics from previous studies can be included to further contextualize the issue.

3) In general the discussion on the treatment of data is excellent. However, there is no mention of missing data and how it was dealt with, if present.

4) For Table 5 and Table 6, the column headings of the tables in the EDA section can be made readable. It is a little difficult to interpret the 'CHR' part of percent_unemployed_CHR. Also, the both tables are displaying the top and bottom 6 counties rather than 5. The ID can also be removed from Table 6.

5) Since the distributions and later analysis deal with the numerical features, it may be worthwhile to list the number of numeric features under the data section in addition to the total number of selected features.

6) It will be really helpful to provide the general form of the linear regression model used before Figure 2. This will make it easier to understand the plots and the discussion of the coefficients that follow. I do see that in a previous section it was mentioned that a linear regression model with interaction is being used.

7) In Figure 2, I see that some of the features have outliers which are expanding the range of the x-axis. Since the key takeaway here is the slope of regression line, it might be better to zoom in on the plot and emphasize the difference in slopes.

8) In Table 9, rather than looking at the coefficients using a random sample, it may be better to order the coefficients by significance to get a true measure of all the significant coefficients. This will also emphasize the most important social determinants.

9) The discussion section mentions some very relevant points. However, the comparison between the coefficients should be explained in a little more detail as the 'normalizing' mentioned does not indicate how the data was treated (scaling?). If this is the case, it may be better to provide an explicit plot showing the coefficients after scaling the features. As the magnitudes of the features are very different, this is really important for comparisons to be made.

10) It will be great to explicitly mention the most relevant features to the model in the results section and possible interpret them for further discussion.

Overall great job guys, it was really nice to review your work! Best of luck for the future milestones!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hello!

Thank you very much for your detailed feedback.

Here is the list of feedback we agreed with and how we responded to them:

Feedback 1: Redundancy of get_data.R script (@Jacq4nn, @NikitaShymberg) Response 1: Deleted get_data.R script

Feedback 2: Visualisations did not show up in report (@Jacq4nn, @NikitaShymberg) Initially, we used the here package to specify the file path to our images. However, it seemed like the plots were not rendering in the report due to this. After changing the file paths to relative paths, the problem was fixed. Response 2: Used relative paths instead of here package to specify figures in the report

Feedback 3: No information about how missing data was handled (@miyer26) Response 3: More detailed information about how features were transformed has been added under the Analysis section of the report

Feedback 4: No function documentations in the scripts (@NikitaShymberg) Response 4a, b: Included function documentation in relevant scripts

Feedback 5: License was not copyrighted to the correct authors (TA, Milestone 1) Response 5: Added author names in license

Feedback 6: Question was not specific enough in terms of the explanatory variables to be used (TA, Milestone 1) Response 6: Added examples of the explanatory variables to be used

Feedback 7: Insufficient interpretation of EDA figures (TA, Milestone 2) Response 7: Added more explanations about the results of EDA and our interpretation

Feedback 8: Empty data/processed directory and unclear instructions (@Jacq4nn) Response 8: Added clearer instructions to refer to the README file under the Usage section and added processed data file to repository

Feedback 9: Time out error when downloading data file (@miyer26) Response 9: Improved implementation of get_kaggle_data.R script by extracting zipped files immediately instead of reading the data file first

UBC-MDS / data-analysis-review-2021