What Types of Models to Run

aksomers commented 3 years ago

I started playing with doing a simple time series and quickly confused myself, mainly because the issue is a little trickier. What we really have is many time series, each at a county level. So modeling it as a time series gets weird.

I'm sure there is a way to do some kind of hierarchical model or pooled model, but I guess it isn't as simple as I had hoped.

There are probably ways to start simple and roll this up to a CW level or state level and just get something going before we think about the full blown solution.

I also wonder if it makes sense to throw some bagging/boosting or even standard regression methods at it first. There's just something so theoretically appealing to treating it like a hierarchical situation rather than fitting a bunch of separate models or throwing something else at it...but it just starts to get complicated.

Thoughts? I know I kind of rambled above, but hopefully this helps get across what I'm thinking about.

amitp06 commented 3 years ago

My thought is that we could aggregate time series by month and then use cast() or similar to reshape the dataset. Right now, we have one row per day-county combination. After aggregating by month and reshaping, we would end up with the following columns roughly (where the suffix represents the month):

state, county, cases_03, deaths_03, mobility_vars_03, cases_04, deaths_04, mobility_vars_04, ...

This gives us one row per county which is much more manageable for modeling purposes. From there, I would start with a standard regression or classification model depending on the specific question we start with. Each column can be a feature and we would feature engineer to get more relevant ones.

Example: cases_diff_04 = cases_04 - cases_03; casesmean = mean(cases**)

With the number of counties, I think we have a solid amount of data for a regression or classification problem. Of course, I'm not opposed to doing something more hierarchical, but would need to give it more thought how to approach such a thing. Maybe we start simpler and expand later if we can.

aksomers commented 3 years ago

In general I like this idea, my big question is--how possible is it to aggregate? Because the mobility data is % change from some baseline numbers over some week in February (I think). Do we know what the baseline is?

Agree with starting simple regression problem and growing it from there.

amitp06 commented 3 years ago

Google says this about the baseline: The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020.

Since the baseline is a fixed quantity, I think we can treat the difference from the baseline as a score that can aggregated instead of treating it purely like a percentage. In other words, the aggregated mobility number may no longer represent the % difference from baseline, but it still captures the signal of "mobility compared to normal". What do you think of that?

aksomers commented 3 years ago

I think that's reasonable. Mean or median, maybe?

amitp06 commented 3 years ago

Yeah, I was also planning to start with the median of the month's days to represent each month.

aksomers commented 3 years ago

I pushed code (data_munge2.R) that creates a wide dataset with monthly columns. Probably won't have a chance to do much more until later.

Should be easy to create some of the features you describe. What do we want to predict, exactly? Changes in cases/deaths from month 7/8 based on prior mobility data as a start? Something else? You're a better modeler than me so definitely would like your thoughts.

amitp06 commented 3 years ago

Thanks for the updated dataset and compliment! I was brainstorming different things to model, but I think a lot of them would require a more complex time series and/or spatial statistics setup than we'd like to do for now. Let's go with what you said since it's the most fundamental question. To be specific, I propose this version:

Given the mobility data and population in February - August, can we predict the growth rate of August cases within a county? Growth rate = (August cases / July cases - 1); set to (August cases - 1) if prior expression is undefined. The hope is that the growth rate response and population variable allow different size counties to be more comparable.

We can start with a linear regression directly. I suspect we will end up turning it into some sort of classification problem (high growth county vs. low growth county). Let's see how it performs first. One caveat for the regression is that there is likely some correlation between neighboring counties. We might move on to other methods like a random forest depending on how bad that is.

amitp06 commented 3 years ago

After running different models, I settled on a logistic regression selected using a stepwise AIC process (AUROC = 0.66). It's not insanely predictive since we only have mobility/population predictors, but all of the diagnostics confirmed that mobility is useful for predicting the growth rate. This is more than I expected for such a complex problem. Pretty exciting! You can run the latest version of the code to see the model diagnostics in terms of the confusion matrix, ROC curve, and calibration plot. They all look reasonable to me.

I chose to limit the model to counties where the mobility data was not missing. The implication is that our model and analysis will only apply to large counties since Google masked small counties. If you run the model again with counties that don't have mobility data, it still works, but the performance is dragged down by simple imputation I saw. We can either limit the scope of our inferences to larger counties or we can invest time into researching imputation in order to make the inferences more broadly applicable. I personally prefer the tradeoff making more precise inferences at the cost of generalization.

Lastly, I saved my selected model as a Caret rds object in the repo. If you happen to find your own model candidate, feel free to save it off as well. I suggest we use these files to predict Sep/Oct data in the coming weeks since we've only used Feb-Aug data to train so far. We can load our model(s) in a new script and see how it compares to the truth.

aksomers commented 3 years ago

This is great! I think the tradeoff you mention and the decision you made about it makes total sense.

I starting writing our October 11 script. I think we have at least 2 candidates for the Feynman method, and I'm sure there are others, so I wanted us to throw our ideas for that on the page before we pick a route. I also wrote the "1 minute" script for you to edit as you wish.

If we can land on a topic for the method today, I am hopeful that I can have a draft script by end of weekend, and then you and I can finalize it during the week as time permits and record near end of week/next weekend. Does that plan make sense to you and your current work/school balance?

amitp06 commented 3 years ago

Yup, that sounds like a good plan. I'll go over to the script thread now.

amitp06 commented 3 years ago

Closing. Addressed in last two videos.

amitp06 / COVID-19

What Types of Models to Run #3