Closed Rational-IM closed 4 years ago
This is a good insight, I'm looking forward for an answer too.
The database for sure uses a "1/x" function to complement some missing data. For instance, maybe they have a figure for a certain state, but not for all counties. Therefore they assume that some locations have a y1 rate of infection, another a y2 rate of infection, etc. Below is a series of plots (in red) for y = range (0, 10). Hopefully, they will eventually distinguish between actual vs. fitted data points.
You mean that a few counties have indeed 1, 2, ... 10 cases only and therefore they form the bands? In such a widespread virus, highly infectious, it shouldn't be the case (i.e., what are the chances of only one person having the flu in a 50,000 people county during the month of March?). And given how many cases we had in places like NYC, New Orleans, etc, it means that the virus has been around for a while. However, it just might indicate that we are lacking testing to such a large degree that the bands indicate just that: i.e., very few tests performed (so we don't even know how many people are infected). I will keep checking if the prevalence of these bands remain for too long - if so, I will post a question here again.
After reading your reply, I you might me right.
Thank you for the observations - your original comment made me take a look at the progression of the chart over time. So instead of always getting the "last available day" I can now choose a specific date. So take a look at the charts below, using data from 4/7/20, 3/31/20 and 3/24/20. The amount of data coming in is very high. I will wait for the data to become more "stable" before doing any analysis across different counties.
How are you calculating your confirmed cases per million? Are integers, significant digit precision, and rounding errors involved? Just curious about if this is a spreadsheet thing.
No precision issues - the charts are plotted using Python, and because it is a small amount of data, I'm using DataFrames (easier to visualize - see below). The code for this specific calculation is:
pop_adj_conf_ct = conf_per_county.div(US_pop, axis=0)1000000 true_inf_r
Where "conf_per_county" is the original data downloaded from GitHub; US_pop is also from John Hopkins (they have a column on their dataset with population info per county; "true_in_r" is a figure I derive by comparing the (I) mortality rate in the overall population in Italy (more than 12% per confirmed case) and (II) mortality rate of health professionals there (around 0.35% per conf. case).
I guess the issue is indeed how "new" the data is. I even did a GIF, showing the evolution of the point over the next 30 days - it looks like that these blue points will be moving up for a while: https://imgflip.com/gif/3w0fec
I looked for any column in the JHU datasets which has the population per county. I don't see it. Can you point me to it?
It is on the "death count" file. When I use the code below...
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv" deaths_US = pd.read_csv(url, sep=",")
...I create a DataFrame variable "deaths_US". Look at the column just before the death count per day starts. They have the population figure:
I plotted a scatter chart with (i) county population on the x-axis and (ii) confirmed cases divided by population on the y-axis (see below). You can clearly see a few "bands" on the lower/left corner. This signals to me that the confirmed cases for many counties were generated fitting them to a model. Is this assumption correct? If so, is it possible to know which counties have actual reported data vs. the ones that were estimated? I want to eventually run some correlations of confirmed cases vs county-specific characteristics (e.g., the predominance of public transportation, density, the average temperature on the month of March/April, proximity to an international airport, etc). Data fitted to a model would only add noise to such analysis. Maybe a new column could be included in the dataset identifying the fitted data?