amitp06 / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
1 stars 0 forks source link

Final Assignment #7

Closed aksomers closed 3 years ago

aksomers commented 3 years ago
Criteria Ratings
What are the 3 most interesting (to you) things you discovered about your main scientific question(s)? 30.0 pts - Full Marks; 20.0 pts- 2 things; 10.0 pts- 1 thing; 0.0 pts- No Marks
What was the biggest challenge you faced in this project. Be specific, discussing how the project would have proceeded faster/smoother without this issue 30.0 pts - Full Marks; 24.0 pts - Some lack of specificity - Clearly state the challenge but without talking about how the project would have been different without it.; 18.0 pts - Substantial lack of specificity - Vaguely stating the challenge; 0.0 pts- No Marks
If you started the project over from the beginning right now, knowing what you know, what is a major way in which your project would have turned out differently? 25.0 pts - Full Marks; 0.0 pts - No Marks
R Shiny app - Provide a link to a working R Shiny app that I can navigate to that displays some aspect of your project. Use this app to discuss one of the above bullet points in your presentation 15.0 pts - Full Marks; 0.0 pts- No Marks
aksomers commented 3 years ago

Putting together a script and planning issue while we figure out exactly how many points to shoot for. For example, at most 10 minutes video I'm sure we'll do. I'm all for NOT doing the optional RMarkdown :).

aksomers commented 3 years ago

Video Script

Reintroduce Topic

Andrew -*- Our main question is "what is the relationship between COVID and community mobility?" We are mainly trying to assess how different categories of community mobility correspond with COVID case growth at the US county level.

The three most interesting things we were able to answer were...

1 - The intuitive set of variables selected Higher mean work and grocery mobility corresponded to "high growth" counties. Increase in mean residential mobility also corresponded to "high growth" counties. Higher parks mobility corresponded to "low growth" counties.

All of these make intuitive sense to us based on what we know about the disease--if more people are "back to work" and/or shopping for groceries, there are more opportunities for exposures and thus case growth. This also goes for residential mobility--if people are visiting each others' houses, presumably they are often coming in close contact with others for extended periods of time which also intuitively should correspond with growth. On the other hand, areas where parks (i.e., more outdoor) mobility may indicate areas where people are doing outdoor activities in lieu of indoor ones, which ought to carry less risk.

Show the Shiny app at some point while talking https://amitp06.shinyapps.io/COVID_Mobility/

This shiny app helps show a little of what we were getting at--what is the relationship between July mobility categories, measured on the x-axis, and July-to-August growth, measured on the y-axis. Each dot on the chart represents a county. We can see that mobility trended downward (negative) for most categories, but parks and residential were higher. It is harder to ascertain overall univariate trend in these results, which is why the multivariate regression we ran is more useful when understanding the ultimate impacts.

2 - Unintuitive results Other selected variables were population, changes in transit mobility, mean retail mobility. The coefficients on these selections made less intuitive sense, at least on the surface--all else equal, a higher population to us would have suggested more potential for growth. However, there could be confounding variables related to density of the population, and more highly populated counties may have a stronger preventative response. Transit and retail mobility levels are harder to explain and may require more research, since intuitively, we would have expected more cases with higher levels of these variables. Of course, since the regression model coefficients are the value GIVEN all other variables in the model, we have to remember the direction can be counterintuitive. -*-

Amit -*- (need to add more here when speaking)

  1. Performance degradation over time. We know so much more now about how to mitigate risky scenarios and what is and isn't risky, so it's maybe not surprising that growth patterns are changing and the things that indicated high growth in the past no longer necessarily indicate it, at least to the same extent.

Biggest Challenge

The largest challenge in this project was joining two datasets from different sources and trying to get the full value out of both of them. The mobility data from Google had a lot of missing values. Especially for smaller counties presumably due to privacy reasons. The COVID data was more complete, but had some unusual formatting when it came to counties and dates. It took some time to figure out how to combine this into one dataset. We had to reshape and clean.

Even after having a single data set, this posed limitations on which methods we could use and how much we could generalize. The counties were not missing at random. Imputation was also difficult given the nature of this data. It didn't seem realistic to make broad assumptions about case growth for missing counties. So, we had to reduce the scope of model to counties with mobility data (which may not represent all counties). -*-

Andrew -*- Without this challenge, the project would've progressed faster because we would've invested more of our time into trying different modeling methods without worrying about the impact of nulls. For example we tried to better account for the county-level structure and exogenous variables using VARMAX, but ultimately didn't find a solution we were happy with. The length of our data was roughly 200 days, and we had thousands of counties, plus the exogenous variables--so we quickly ran into some dimensionality problems and never really found the time to reformulate our code, or our ultimate scientific question, in order to accommodate these problems.

If we were to start over from the beginning right now

Less time on VARMAX, a "theoretically nice" setup for this type of problem (where we technically have many different concurrent time series at a county level and a lot of exogenous variables), and more time on some less classical methods that may be better equipped to handle situations where we have more predictors than supervisors (p > n). We found papers where people had success using "machine learning" techniques for time series forecasting (e.g., random forest for H5N1 outbreaks here https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-276). -*-

Amit -- Other data scientists have described success with using XGBoost to help handle missing values, which were abundant in our data (considering a large part of our data depended on Google's penetration into the cell phone market) https://towardsdatascience.com/classical-time-series-vs-machine-learning-methods-80290850bd5b. It's possible we could have had more data to work with and been able to better model less dense counties. --

amitp06 commented 3 years ago

One idea for the app is to showcase how the categories are different. X axis = mean mobility score bucket (< - 20, -20 to 0, 0 to 20, > 20), Y axis = case growth (as a %), and there is a filter above the graph that lets you switch the categories between work, grocery, parks, etc. That's just one example of a layout that makes sense to me. If it's impractical or you want to continue building on your prototype, I'm open to something completely different.

I can add specific text to the script later, but here are my general thoughts on what we can talk about in the video:

For 3 most interesting things, I would stick with the two you have plus the fact that the performance degraded over time. That indicates how quickly the growth patterns are changing. This whole question is kind of a recap of our last update.

For biggest challenge, I would focus on the "joining two datasets" challenge. We haven't discussed it much yet, but the amount of missing data and various join issues really limited the scope of what we could do.

For the "start over" question, I agree with what you have written already. Not much more needed to answer it.

amitp06 commented 3 years ago

We're done!