amitp06 / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
1 stars 0 forks source link

Work and Script for Nov 15 (really nov 11 :) ) #5

Closed aksomers closed 3 years ago

aksomers commented 3 years ago

Work to do:

Answer:

new work before the video

aksomers commented 3 years ago

I updated the COVID data and the Google data and pushed it to our remote. So I think you'll just need to do a git pull to get everything through early November.

I adjusted the data code slightly to split out an out-of-time (OOT) set. It wasn't really necessary but I kept it anyway since it's kind of nice to have it separate.

I created a predict_sept and predict_oct code file that take that OOT set, mess with the naming to make things easy, and score it to predict low/high for sept and oct. Then it returns the same stats that you did before. I need to look closely, but I think our precision went down while recall was fairly stable. So I think this means that we are returning more false positives. Maybe we need to play with the threshold?

I think the items outlined above are easily 5 minutes already. Is there anything else you want to try before the deadline? Should our final few weeks be trying to get some kind of simple time series/varmax going? I think the last video is sort of an overall summary so we can reiterate all the prior stuff, plus any new stuff we glean from varmax (or a discussion of why it didn't go well).

aksomers commented 3 years ago

What is the main scientific question (or questions) that you are trying to answer. Be specific!

Andrew -*- A short and broad version of our main question is "what is the relationship between COVID and community mobility?" There are multiple topics that fall underneath this category. We are mainly trying to assess how different types of community mobility correspond with COVID case growth. Specifically, of the following categories of mobility: Grocery & pharmacy; Parks; Public Transit Stations; Retail & recreation; Residential; and Workplace, how do mobility changes in these categories impact COVID case growth at the US county level?

There are also multiple sub-questions that we may or may not assess depending on the data. Does this relationship persist across time (i.e., is it predictive for future months)? Does the relationship differ by regional geography? Does the relationship differ by county population? -*-

What methods are you using to answer this question. Again be specific! By specific, I mean: just saying multiple linear regression is not specific. Specify the supervisor and generally state the features and how this method can be used to answer one of the main scientific questions. If you are using the lasso, how are you choosing the tuning parameter? Are you standardizing your features?

Note--I think this section is worth half the points, so we should spend the most time on it.

Andrew -*- By county, we record the number of cases at the end of each month and calculate the percent change in cases from the end of the prior month. Using the median percent change as a cutoff, we divide the counties into "low growth" and "high growth" counties. We train the model using this binary growth label from July to August as the supervisor. This suggests a logistic regression framework as one possible solution. We then use forward stepwise model selection (with AIC being the metric for which variable to add at each step) in order to train our model. In order to measure the model's performance accurately, the model is trained using 10-fold cross validation.

The candidate predictors/features are the mean mobility values (by category) for the month of August, the overall county population in August, and the change in mean mobility from July to August (by category). Negative values represent a decrease in mobility while positive values represent an increase in mobility. So putting this all together, the model will tell us how mean mobility changes and the mean mobility level correspond to whether a county is "high growth" or "low growth" in cases. -*-

Amit -*- As a concrete example: Navajo County, AZ had a Work Mobility score of -19 in August but a change value of +5 in Work Mobility from July to August. Similar variables were calculated for other categories. The overall growth in COVID cases was approximately +8% from July to August which was far below the county median. Therefore, the true label for this county was "Low Growth". We used all of the county data in this manner to build the model with stepwise AIC and cross-validation.

Since fitting and saving this model object, we now have data through the end of October, so we can use the trained model object to see how it predicts for these months, and whether it corresponds to what actually happened.

(show table of results in google doc and/or talk through the numbers; calibration plots are included)

We start with our test data performance from the August cross-validation (not the training error). The model metrics are not very high, but it's clear that the model is providing some predictive value. This is encouraging for a small set of variables in an incredibly complex problem. We were happy to see that the model is reasonably well-calibrated too.

Over time, the model is increasingly predicting “high growth” counties more aggressively than the observed truth. This may be because the exact trends that indicated high growth counties in August are starting to shift. Still, the model retains some predictive power.

We can change the model’s decision point to get more balanced precision and recall. For example, in October, if we change the model prediction threshold from 0.5 to 0.6, the new precision is 0.602 and the new recall is 0.574. This doesn’t affect the AUC or calibration diagnostics though. It also doesn’t change the fact that the model performance degraded over time.

One more note is that this model’s performance requires that the “high growth” counties are defined as above the median of the month of interest. This median is shifting as time goes on too. This is just another reason why time is an important factor with this data.

In the future, we will try out VARMAX to address this concern. We may also try an alternate variable selection method to get the variables for VARMAX. -*-

What is the most interesting (to you) thing you have discovered so far about your main scientific question(s)

Andrew -*- We found that:

So far, our most interesting finding is the set of variables selected.

  1. Higher mean work and grocery mobility corresponded to "high growth" counties (should we quote actual %/probability changes/etc.? -- I think we can just state general trends for now and include final numbers in final video)
  2. Increase in mean residential mobility also corresponded to "high growth" counties.
  3. Higher parks mobility corresponded to "low growth" counties.
  4. All of these make intuitive sense to us based on what we know about the disease--if more people are "back to work" and/or shopping for groceries, there are more opportunities for exposures and thus case growth. This also goes for residential mobility--if people are visiting each others' houses, presumably they are often coming in close contact with others for extended periods of time which also intuitively should correspond with growth. On the other hand, areas where parks (i.e., more outdoor) mobility may indicate areas where people are doing outdoor activities in lieu of indoor ones, which ought to carry less risk. -*-

Amit -*-

  1. Other selected variables were population, changes in transit mobility, mean retail mobility. The coefficients on these selections made less intuitive sense, at least on the surface--all else equal, a higher population to us would have suggested more potential for growth. However, there could be confounding variables related to density of the population, and more highly populated counties may have a stronger preventative response. Transit and retail mobility levels are harder to explain and may require more research, since intuitively, we would have expected more cases with higher levels of these variables. Of course, since the regression model coefficients are the value GIVEN all other variables in the model, we have to remember the direction can be counterintuitive.

However, these variables also had lower p-values/significance (is this even acceptable to say? -- I think so). Since we used a stepwise process, the p-values are no longer valid, so we left these variables in the model.

Brief wrap up -*-

amitp06 commented 3 years ago

I spent some time looking at the results we have so far. See the doc I linked at the end of this post for my summary. The model definitely is suffering from the time element though it's still capturing some signal which is nice to see. I think we should go ahead with the idea to try VARMAX after our next milestone. We can also take a second stab at variable selection with the new methods we've learned. I will work on the script next after we discuss.

https://docs.google.com/document/d/1TTCNccmk3qHrnbuk9zY1mxO5GJQocNGkhCbsvweB9lE/edit?usp=sharing

amitp06 commented 3 years ago

Closing. Addressed in last two videos.