Suggestions for additional datasets

kirienko commented 3 years ago

It seems that data for Russia, level=2 is coming from this repo which itself is not updating. Of course it's easy to say «That's not our issue but theirs.» But no.

eguidotti commented 3 years ago

Have you tried to contact them? Can you suggest other data sources we can use instead? Thanks

kirienko commented 3 years ago

No, I didn't try. There is an open issue there. And although the repo was updated since that issue was opened, it doesn't seem to be done on regular basis (i.e. in automated fashion).

I would suggest to use this source which looks more automated. But since I'm by no means affiliated with those people, I cannot guarantee it works better.

kirienko commented 3 years ago

Well, it turned out that the main JHU repo has actual level=2 data for Russia. I think it's the best choice.

eguidotti commented 3 years ago

Thanks for the suggestion. But unfortunately I'm not able to find the historical data for the regions in the JHU repo (see here). I'm cross checking the data with the source you suggested so to move to this new one

eguidotti commented 3 years ago

Hi @kirienko. After speding some time I was not able to validate the data from the other repo. Moreover, I'm afraid it may be discountinued as well in the future. I agree that the best choice would be JHU since it seems there are not open governmental data for Russia (this is what they said when we tried to contact them some months ago). I opened an issue at JHU repo to ask if they are going to add the Russian regions in the time series dataset. That would be great and easy to integrate. Otherwise, we'd need to do that by hand, but would require a lot of work.

kirienko commented 3 years ago

Hi @eguidotti. Thank you so much for your efforts! I really appreciate your work!

greg-minshall commented 3 years ago

hi. i have been pulling and cleaning up the JHU data, more or less since the beginning. just having been pointed in your direction, i was thinking of converting to your data. but, it's true that the data i've been processing might have some data you want. if you want daily changes, you'd want the columns whose names end in "_Changes_1".

my github repository is here, though i wouldn't want to be the first person other than me trying to actually do a build.

if you're interested, and have any questions, please let me know. cheers!

eguidotti commented 3 years ago

Hi @greg-minshall and thanks for your message. Do you have region-wise data for Russia? That would be quite interesting. Also, are you hosting the data you generate somewhere? I couldn't find them. Thanks!

greg-minshall commented 3 years ago

hosting is here. yes, there is oblast, etc., for Russia, for example, here -- assuming that's sort of what you are hoping for.

greg-minshall commented 3 years ago

btw: the README.org in my repo had old links; i've just changed that (in case you poke around), hopefully correctly. cheers.

eguidotti commented 3 years ago

@greg-minshall Well actually, it looks quite interesting! It seems to me the file I could easily integrate in this repository is the following: https://somenumbers.info/covid-19/csvs/coleaned.csv.gz

Just a couple of questions:

is the file updated daily?
is there a link to the cleaning performed on the JHU data?
do you have a title/citation for your project? I'd need to put this in the data sources if we decide to integrate it.

Thanks!

greg-minshall commented 3 years ago

@eguidotti coleaned is basically a cleaned up version of the the entire run of JHU "daily" files.

yes, the files are updated daily. i think, in the past six months, there have been only a few glitches, maybe one time when something changed in the JHU data.

i don't have a citation. you can just say "Greg Minshall", some such. or, a pointer to the repo.

if coleaned works, that's great, as it's the smallest file, and you'll be the closest to JHU (in terms of me messing with the data).

the cleaning performed? you can look in covid19.org in the repo (not that you'd want to), in a section colean.R. part (compute_intervals) is dealing with early times when, e.g., Australia (i think) was represented for a while by Australia, then by some of the regions, then back to the country, etc.

then (propagating, but also in colean) deals with making sure that every entity has an entry for every date after it first appears. it is very slow, doing lots of dplyr::group_by(), etc. (at some point i'd like to switch to data.table, partly in the hopes it might be faster; and, actually, it's my memory of the complexity of the cleaning code that gives me pause.)

i also drop some JHU columns

known_excludes <- c("Incidence_Rate", "Case-Fatality_Ratio", "Incident_Rate", "Case_Fatality_Ratio")

as i think i can derive those from the existing data. (though, in fact, i don't.)

there's filtering: remove duplicates, take only the last observation (Last_Update) from a given date.

i add FIPS and Iso3c columns.

there's also some textual transformations (dosed and friends -- see the file fixups.sed in the repo) on the .csv files, to deal with anomalies early on.

that seems to be about it.

greg-minshall commented 3 years ago

i realized in my listing of transformations i missed some bits that used to be in the file fixups.sed, but are now embedded in covid19.org, in a table csvsedtable inside the csvsed header. these mostly normalize names at the Country_Region, Province_State, and Admin2 levels; plus some fiddling with the odd FIPS.

eguidotti commented 3 years ago

Thanks @greg-minshall for the information. I have integrated the data for Russia. Let's wait a couple of hours for the workflow to complete and see if we can close this long-standing issue.

I was also interested in the recovered cases for USA. But I see only very few observations (dates) for each state. E.g. Alabama has only about 10 observations in https://somenumbers.info/covid-19/csvs/coleaned.csv.gz Is it the same in JHU data?

As far as I understand, the other files are aggregating the numbers. E.g. compute the totals for Alabama by summing up together all entries that include Alabama as the upper level in the combined key. Is that correct? At a first stage of this project, I was also aggregating the data in this way but then I noticed that it usually doesn't work. In my experience, they almost never matched with the data provided directly for the upper level. For instance, if only one city is missing in the data, the aggregated state-wise counts are downward biased. Moreover, the data released for the upper level may include travelers or cases in which it is not known the exact location. So unfortunately I won't be able to use the aggregated data.

greg-minshall commented 3 years ago

@eguidotti, you're welcome. i hope it helps. let me know.

i only use JHU data.

i think 'Recovered' comes from JHU's csse_covid_19_data/csse_covid_19_daily_reports_us series. i have a (very recent) "issue" in my repo to remove that series. i think they started recording that, then discontinued (probably the data wasn't reliable).

yes, you're right about the aggregation technique. i think when i originally did that work i did some verification. if i look at the JHU data now, for example, for California on 2021-04-04 (csse_covid_19_data/csse_covid_19_daily_reports/04-04-2021.csv), i don't seem to see numbers at the state level, only at the Admin2 (county) level. for Canada, i only see data for the provinces/territories, not for the country as a whole. so, in my experience, there is no "data provided directly for the upper level" (a situation which makes me happy, being a believer in second normal form :).

were you looking at these "daily reports"? or, the more often used "time series" (that's a set of data i don't use so am not familiar with).

eguidotti commented 3 years ago

@greg-minshall yes, it works and I'm going to close this issue. Thanks a lot!

were you looking at these "daily reports"? or, the more often used "time series"

Time series data

there is no "data provided directly for the upper level"

I guess that's the case for JHU. What I mean with "the data released for the upper level" are actually data that are released directly from the government for the upper level (not necessarily US, but around the world). In general, when I aggregated data from the lower levels I never got the counts provided for the upper level. Also, in many cases JHU data (aggregated or not) do not match the ones available from open governmental data. That's basically the motivation behind this repo :) We try to pull the data from the official providers whenever possible. But in many cases it is not possible, and works like yours are very useful!

greg-minshall commented 3 years ago

@eguidotti ah, "ground truth", or whatever the saying is. no, i decided early on that for me, JHU == Truth.

btw, i've killed off the embarrassing Recovered, et al.

also, if you ever wanted (as a backup, say, to my build process), probably producing a daily coleaned.csv file would be reasonably easy for you to do in-house (using the R script i provide).

eguidotti commented 3 years ago

Ok I downloaded your repo as a backup, but I hope everything will go smoothly. Thanks again!

greg-minshall commented 3 years ago

is it legal, useful, to post to a closed issue?

anyway, @eguidotti, you might look at this issue on my site.

i won't do anything about this soon, but that data set might also appeal to you (instead of my coleaned.csv). i'll be curious of your thoughts.

cheers.

eguidotti commented 3 years ago

Hi @greg-minshall, thanks for posting this! It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers. I'll reopen this as a reminder for me to get it done, or maybe some volunteer shows up :) Many thanks!

greg-minshall commented 3 years ago

It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers.

yes, i agree. cheers!

eguidotti commented 3 years ago

After months of work... it's done! The new version is available. Please see the changelog

greg-minshall commented 3 years ago

Emanuele, congratulations. are you still pulling from my data (for states/provinces/oblasts)? just so i can feel un-guilty if/when my builds break... :)

eguidotti commented 3 years ago

Hi Greg, I have switched to the JHU unified dataset as you suggested. Many thanks for your package and your input, it has been very useful!

greg-minshall commented 3 years ago

good -- enjoy!

covid19datahub / COVID19

Suggestions for additional datasets #135