Decide what method we want to use

MosesStewart commented 7 months ago

I have class until 2:00 pm but here are my initial thoughts:

CHANNELS The prompt asks us to identify what channels we think COVID spread through. After looking at the dataset some possible "channels" could be:
- They have a column for the proportion of people in each zip code who use public transport, so one hypothesis could be it spread through there.
- Another interesting channel could be average family size, implying COVID may have spread through schools
METHOD The prompt seems to be pushing us towards a regression, since they link code from "Hsiang" using spatial autocorrelated standard errors. We do have latitude data, which I think we need for this. I have no idea how spatial autocorrelation works, but since they believe this should only take us 4-6 hours, I'm guessing they expect us to use this. I would prefer implementing in Matlab over Stata
MODEL They are asking for causal analysis. The two variables I suggested through channels are both continuous random variables. I'm honestly not familiar with proving causal relationships with non-indicator random variables, so any ideas on how to structure the model would be appreciated. I can also try to look at some stuff later. If we go with regression, all we really need to decide are which variables we want to include.

Other questions I haven't been able to figure out:

They ask why spatial autocorrelation correction is more appropriate than a simple heteroskedasticity correction. I am do not know. I would have to do some reading to answer this.
They purposely excluded data for 2 days. I'm not sure at what the best way to go about handling that is.
We have COVID data that stretches over 2 months, but for the CHANNELS I suggested, there is only fixed data from one time point. If we want to regress over zip codes (locations), how do we implement the COVID data stretching over time? May be helpful to look at other papers. Should we focus on early weeks of the pandemic like they say in the abstract? Should we do a weighted average of cases, with heavier weights during early weeks?

zoe-shleifer commented 7 months ago

The seem focused on how we measure covid rates. Remember when people were measuring covid rates in the sewer system. Would be interested in how much of this data there is. will look after 1:30 when my class ends.

MosesStewart commented 7 months ago

The seem focused on how we measure covid rates. Remember when people were measuring covid rates in the sewer system. Would be interested in how much of this data there is.

They gave examples of dependent variables such as infections per capita and positive testing rate that we can already construct with the data they provided.

If you want to take the initiative to get additional data then that's fine, but I would remember that this is only a qualification round. I think they're just looking to make sure we can show a causal relationship with comprehensible reasoning in around 4-6 hours of work

MosesStewart commented 7 months ago

I would recommend looking at the paper they cited. They use the same data we are given, and answered several of my questions:

CHANNELS: They found a correlation between occupations and COVID rates, where people with jobs where they had to interact more had higher rates. After controlling for occupations, they found that the length of commute was not significant. They also found significant correlation for household size.
MODEL: I think, similar to the paper, it would be best to run a few regressions, including more variables each time.

I'm honestly not familiar with proving causal relationships with non-indicator random variables, so any ideas on how to structure the model would be appreciated. I can also try to look at some stuff later.

In the paper they cited, they just used p-values on the coefficient being greater than zero. (Edit) I'm leaning towards simply using a z-test for all of our p-values if no one objects.

They purposely excluded data for 2 days. I'm not sure at what the best way to go about handling that is.

We have COVID data that stretches over 2 months, but for the CHANNELS I suggested, there is only fixed data from one time point. If we want to regress over zip codes (locations), how do we implement the COVID data stretching over time?

In the paper they just averaged over weeks. We can do the same and I think not worry about it.

They ask why spatial autocorrelation correction is more appropriate than a simple heteroskedasticity correction. I am do not know. I would have to do some reading to answer this.

I still don't know this, or why a regression would be well-suited to this situation/ what weaknesses it has. I have class again, but I plan to start writing the code/text around 7:00 pm, so would be nice if we can finalize the direction we want to go for CHANNEL and MODEL

MosesStewart commented 7 months ago

They ask why spatial autocorrelation correction is more appropriate than a simple heteroskedasticity correction.

This was a lot easier than I thought ~ a heteroskedascity correction assumes that standard errors are independent across observations, which we wouldn't expect if zip codes are correlated with each other.

MosesStewart commented 7 months ago

I will start working on an implementation of the code they provided in Matlab tonight. If we want to add data re https://github.com/MosesStewart/uofc_prelim/issues/1#issuecomment-1938940774 then we can do that later. After the implementation, we will still need to discuss independent variables and start writing.

MosesStewart / uofc_prelim

Decide what method we want to use #1