CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.1k stars 18.38k forks source link

This is very poorly maintained. Please add comments. #1971

Open everettwolf opened 4 years ago

everettwolf commented 4 years ago

I want to see if people agree. There are a lot of smart people trying to use the data here, and the number of obvious - fixable - errors, and odd midstream changes in data format, without retroactively updating older data, or forking it, makes this pretty difficult to do anything super constructive with.

The US data took a total dump when a “Recovered” field was added, lumping a total in there, making daily cases zero. This is so obviously a bad thing to do, as far as being able to come up with recovery rate timelines, etc, that it makes me really hope this isn’t a source of data we’re seeing all over the news.

Can people chime in, so that maybe this will be rectified? Took me about an hour to find a ton of problems that have repeatedly been reported here, and I’m no genius.

texadactyl commented 4 years ago

I wrote Python code and maintain a database of a subset of the daily report information, allowing me to slice and dice the reports data and way I want. I suggest people do the same in the programming language of their choice.

Yeah, I ran into the column heading renames [(=:] around the 3rd week of March 2020 and updated my source. Since I use the Python pandas package for reading/filtering the CSV files, I select and reorder columns as best fits my database loader. The "documentation" for me is what I see on https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 and other reputable places (not always in agreement!).

So, documentation is not really an issue for me. I'm the type who never reads the manual anyways.

Sometimes, the data is suspect. E.g. everything from China. Oh well.

I am sympathetic with JHU as they are peddling as fast as they can with the few resources that they have for this data collection.

everettwolf commented 4 years ago

Yeah, I scripted it out as well, transposed the data to put the dates into rows, rather than columns (not sure why they did that, but I'm sure it's for a good reason).

I made it a process, that starts with a git pull, then does what I need from scratch, in the hopes that they would fix up the data retroactively, but instead I have to build ever-increasing patching of their data, as it comes in. Stuff such as keeping the field names consistent from day to day would make that a lot easier. Adding a column for "state/province" (US/CA) means that the county doesn't need a ", KY" after it, making it difficult to parse. Simple stuff like that.

What's ironic is the models you linked to got hit by the "Recovered" change and it was reporting 0 recoveries (in total) for the US for several days running. I thought this was THEIR data.

As far as being sympathetic and forgiving due to few resources and going as fast as they can, I lean toward NOT, as this really isn't a lot of data (I deal with far more on a daily level), and it would really just take one very diligent person with the right access to clean it up.

I get that the data represents something horrible that's happening as we speak, but that actually makes it LESS of an excuse to be sloppy about it.

I'm starting to find better maintained datasets (U of W) so will probably just abandon this one. But I really hope they fix it.

kiran1302 commented 4 years ago

The Big Question is.... Who else can we go to, for the "Single Source of Truth". Is there an alternate source of data for the whole world thats updated better than this?

Like a lot of people, I also started off with this dataset which seemed to give enough information for us to start with some interesting dashboards.

Everything was fine as long as we were showing information at Country level for the whole world.

And then came the request for US States, we waited for a few weeks and then received 2 new files breaking the US data into province/states and further details. That was great, we used that for updating the US related dashboards.

But for the last couple of days, I noticed that the Total of all the US_States in time_series_covid19_confirmed_US.csv does not match the US_Total in time_series_covid19_confirmed_global.csv

For the data of 04-04-2020: In time_series_covid19_confirmed_global.csv, the total for US is 308,850 Total for all the States in time_series_covid19_confirmed_US.csv is 308,845

We know its a tedious task to maintain multiple datasets, and to ensure that numbers everywhere match properly, but we'd really appreciate if it is done correctly, even if it means a few hours of delay (considering that a lot of people are consuming this data for reports, dashboards and analysis).

We hope you'll fix this soon

everettwolf commented 4 years ago

@kiran1302 I've got a lot going on at the moment; if you find a better source of truth, can you reply here with it?

I'm looking to do some time-series data for the US, by State, with a lot of focus on recovery rates.

When they switched to the US county breakdown, I guess a decision was made to remove the "recovered" value, and not track the "active" value (thus making "recovered" inferable). And they took the "recovered" value -- and rather than lumping it at the state level (which would be a LITTLE better than just keeping the granularity), it's at the COUNTRY level, so really untrackable at the state level.

Also, if you find some data modeling from another source that doesn't just focus on confirmed/active/death (as if that's all that matters), would you mind shooting them my way as well?

As a side note, the latest free version of Splunk has some free Covid-19 dashboards with it, but they run into the same problems I've outlined. Also, seeing a lot of git repositories springing up using THIS as the source of truth, focused on the US, and it's really bumming me out. ;o) If we can get some good data, I expect this would explode into something real. The best public site I've seen is from https://worldometers.info, but a) I'm not sure they're publishing their data, and b) they don't quite track it the way I'd like to see.

Thanks!

HuidaeCho commented 4 years ago

There is no doubt that this data repository is very important and I really appreciate all their efforts put into it. However, as others mentioned above, I found the format of this data set very unreliable, which makes it challenging to cleanly process it to make it productive. I wrote a Python script that fetches daily reports from here and their REST service, and tries to create clean GeoJSON and CSV files for my open source web app. I can only hope they won't make other significant changes in the future.

blisx264 commented 4 years ago

99% of programming is data validation, they're doing well for letting the world use the data, lets be proactive and productive with it.

texadactyl commented 4 years ago

@blisx264 The 99% of our code includes a lot of error recovery/reporting. Things could be worse. We could be writing these programs in ALC or COBOL at a 3270 green screen emulator/terminal to run in on z/OS Batch.

cipriancraciun commented 4 years ago

Unfortunately at the moment the JHU dataset seems to be the "preferred source of truth". (Even Google advertises it part of their BigQuery endorsed datasets.)

However there is also the ECDC and NY Times which provide alternative data sources (see the issue and repository bellow for details). (And there are a few other independent scraping or volunteer based approaches.)

That being said, like the others that have commented, I have just created a lot of "mapping" and "cleaning" code, and published my own JHU derived dataset as described in #1281 or available at https://github.com/cipriancraciun/covid19-datasets

everettwolf commented 4 years ago

@cipriancraciun thanks for the comment.

Worldmeters is slowly adding in some fields I wanted to see (still not the granular US State recovery info that got dropped here though), so have switched over to their sources.

Eventually -- maybe after the pandemic is “over” — we’ll probably see a full set of data somewhere and we‘ll be able to model it the way we (I) wish we could do now.

IOW, for what I want, this data can’t be cleaned and mapped.

everettwolf commented 4 years ago

A million times over, this: https://covidly.com/?country=United%20States&showStates=1

Can’t vouch for it’s accuracy, but it has elements that can NOT be produced from the data on this repository alone, which is the stuff I was going for.

THIS is what should dominate the news, IMO. Maybe people here have already seen it, but a friend forwarded it to me.

In his FAQ, he mentions that he had to abandon the data from here for certain datapoints due to all the format changes, but rather than just remove it from his dashboards, like John Hopkins, et al., has, he gets it from one of the other sources and includes it (again, it's only as accurate as aggregating spotty data gets). AND he lets the user decide what data is relevant to them, rather than canning the results to something like JUST death rates, etc. This is what I was TRYING to do with this data alone.