Video Scripts for Oct 11

aksomers commented 3 years ago

1. A 1 minute video with the same objective as Task 1. However, make a new video incorporating any relevant updates, such as new things you've learned in class or about the project (or even, that you've shifted your original project entirely).

(some of this maybe gets moved to the second video...)

Andrew: We are continuing to work with the COVID-19 repository maintained by Johns Hopkins. Our major goal was to analyze this data with Google's community mobility reports. Joining the data accurately was a challenge for a few reasons: one dataset had dates as columns, the other had them in rows; date formats were different; county naming conventions are not consistent; unexplained NAs existed in both datasets that caused unexpected joining behavior. After reshaping and cleaning the data, we were able to create a single useful dataset for analysis. Each county is represented by its population, mobility levels, cases, and deaths.

Amit: We have run a series of simple regression and classification models thus far on one "time slice". We determined that for that July-August, an increase in work mobility always correlates with an increase in cases, and an increase in Parks mobility always correlates with a decrease in cases. This is exciting because it corresponds with our common intuition about the disease: communities with more essential workers and people returning to business-as-usual are at greater risk. Communities that have shifted some of their activities to outdoor "Parks" activities are at lower risk.

Andrew: Future work includes testing the validity of our initial models and investigating a separate time series component.

2. A 5 minute video recording going through the Feynman method for a topic relevant to your project. This could be about the underlying science for the project, things related to coding, or a method you are planning on using. I want you to step through the 4 part Feynman method. For part 3., identify two concepts from part 2. that need to be addressed. Perform step 4. on these concepts and include your discussion on your new understanding of these concepts.

Write down clearly and concisely what you are trying to learn. Don't write down jargon and be as specific as is reasonable.

Going to list some ideas for this method for us to discuss--obviously they'll need refinement but I think that's the point :)

Amit: We want to learn how to analyze this data with time series methodology, or to find reasonable approximations or workarounds for analysis if time series approaches prove extremely difficult.

Explain the concept in simple language. Be on the lookout for moments in which you use terminology from this class. Seek to use the definition instead. Include a very simple example demonstrating the underlying idea.

Andrew: There are a few thousand counties in the United States. For each county, we have daily data on how many cases there have been in the county and how many deaths there have been. We also have a measure of "change in mobility" or "mobility trend" (essentially, changes in how often people visit places in particular categories) from the Google data, where the change is from a a baseline (the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020--i.e. the weeks directly pre-COVID). We also know the population of each county and the state it belongs to. Change in mobility is measured in categories. Categories are selected based on Google's determination of what categories are "essential" and what categories they think might be important for social distancing.

Grocery & pharmacy Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies.

Parks Mobility trends for places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens.

Transit stations Mobility trends for places like public transport hubs such as subway, bus, and train stations.

Retail & recreation Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.

Residential Mobility trends for places of residence.

Workplaces Mobility trends for places of work.

Amit: The way we've analyzed the data thus far is by side-stepping the time series element to it. We've so far aggregated each variable to a "monthly" number and limited our explanatory variables to the mobility change from July to August. Our target variable is the August case count. However, we'd eventually like to disaggregate each variable back to the daily (or weekly) level and find a way to get some power out of using the full time series from February to present. We have a time series of case counts for each county with associated covariates like mobility trends and population.

Andrew: The tricky thing is that most time series methods you run across are for dealing with a single time series (data for a particular cohort associated with a particular series of dates) with maybe some associated covariates. In our situation, we have thousands of time series problems to solve and covariates that should impact ALL counties (like federal actions) versus covariates that should impact only that particular county (like municipality-level actions). Methods that handle this are harder to find information on.

Amit: We've also read about some hierarchical methods that might apply to this situation here. This would be a way to model all of the time series systematically, allowing some of the fitted parameters to remain specific to the particular county at hand, and others to be shared among all the counties. We are not sure if Bayesian techniques will be required for this or not. Or if there's another way to fit all the separate models and ensemble them together.

During the course of 2., you'll run into moments where your explanation is vague or there is something you don't understand or can't relay using non-technical language. Identify these moments here, using a list.

Andrew: Mobility Trend - change in how often people visit a type of place, where change is measured from a particular baseline. Hierarchical time series modeling - have a better explanation of it. Associated covariates - how does a time series normally handle the signal from covariates? Bayesian methods - which Bayesian methods exist for time series? Ensemble methods - which ensemble methods exist for time series?

Seek to solidify these concepts.

Amit: We will likely start with a multiple linear regression approach and do our best to glean insights from that, first. This will basically be assuming that the correlation structure of the errors is more or less handled by the features included in the model, which may be reasonable since some of our features will be time-dependent and/or lagged in some way. We definitely suspect that lagged versions of the supervisor and the predictors will be useful in this model--after all, higher density of cases should beget more future cases/deaths, and since there is a lag between exposure and a case developing, it's reasonable to think that lagged mobility metrics are predictive of future cases/deaths as well. It may very well be reasonable to assume that once we've included our time-dependent and lagged features, the model errors will not be autocorrelated (correlated with lagged versions of themselves).

Andrew: One approach to the time series modeling would be the VARMAX approach. These are an extension of ARMA methods, a classical way of modeling a time series by regressing a variable on its own past values and assuming the errors are also related to their past values. ARMAX is an extension of this that also allows additional "exogenous" variables to be added to the model--important since we think there are mobility variables that will be predictive of cases/deaths. Finally, the V--stands for vector--will be useful if we want to model the time series for each county in the same model (so the supervisor Y is a vector, where each vector component is a case count or death count at a particular time from a particular county).

Amit: One common method for Bayesian time series analysis is the Bayesian Structural Time Series. After brief research, we decided not to invest further time here in favor of the solutions mentioned above. The same is true for ensemble methods like boosting. Instead, we will spend time analyzing the covariate effects and correlation structure in a regression or VARMAX setting. We believe we can capture more of the "time series nature" of this data using these tools.

amitp06 commented 3 years ago

For script #1, the outline sounds good and I don't think we can fit more than that in a minute. I would suggest that whoever talks about the first half goes into slightly more detail about the merging since that has been most of our work so far. And it's fine if that takes some time from the modeling piece since we haven't even finalized that whole approach yet. I suspect the models will be the main topic of the final updates.

For script #2, I think the time series topic is a good one. It would obviously be nice to have a separate model analyzing the series as a whole. We sidestepped the dependence issue so far by applying a regression/ML type model to one slice of the series. I'm not sure either how to relate the whole time series back to mobility in a systematic way. It would be a good one to explore in this assignment.

aksomers commented 3 years ago

OK, wrote out a bit more for each, feel free to edit. Will return to it tomorrow with more thoughts.

aksomers commented 3 years ago

What do you think about a discussion post like this? Also an aside on hierarchical time series modeling. And another on machine learning alternatives.

Dr. Homrighausen,

As we analyze COVID data, we are realizing that a discussion of both some broad and finer points of time series modeling could be useful. Here are some questions that are coming up as we start to think about more complex ways of handling the data. Hoping it would be possible to discuss some of these questions in an upcoming "office hours" session or in any other format you deem most appropriate.

We technically have time series data at the county level--which means thousands of time series. Are you aware of a good hierarchical or other method that allows a set of time series of this nature to share some parameter estimates but not others? It makes intuitive sense to us that the time series will naturally be impacted similarly by some things (maybe starting population density) but then might have some more location-specific features depending on how the county/city itself responded to the crisis.
Would such a method necessarily be Bayesian? The term Bayesian Structural Time Series came up a few times while trying to find information, but not completely sure of its necessity and relevance yet.
Are there other ensemble/reconciliation methods that exist for handling complicated time series data of this nature?
In general, how are covariates usually handled in the time series framework?
Do you have recommendations for how to start out with this problem? Are we biting off more than we can chew by thinking about all of the counties? We've taken a simple, aggregated approach to start with in order to get our hands around it. We avoided the time series complexity by aggregating each county's data in July-August and only using regression/classification features available within July-August instead of the whole series. Is just expanding on that approach likely to be more successful?
Do you have any advice when it comes to using machine learning alternatives to time series forecasting? I've heard just a little about this, but it seems promising.

amitp06 commented 3 years ago

I added some edits to the posts. I agree that asking for feedback on those could be useful. I'll check out the links later on.

amitp06 commented 3 years ago

Closing. Addressed in last two videos.

amitp06 / COVID-19

Video Scripts for Oct 11 #4