CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

I created an alternative dataset #704

Open owahltinez opened 4 years ago

owahltinez commented 4 years ago

I created an alternative dataset, since this data is so important to many of us and it looks like the maintainers of this dataset are overwhelmed. The new dataset is based on the latest snapshot of this one (including the reported issues) and has been designed to be crowd-sourced, so hopefully we won't run into this issue: https://github.com/open-covid-19/data

If this dataset is fixed, we can get rid of the alternative dataset. I'm not trying to create any competition, only to ensure timeliness for the data.

JiPiBi commented 4 years ago

hi @owahltinez

I read through the csv file in your link

For example, the data for the 12th that were criticized in this repo ( as you pointed out) are the same in your repo : so what do you want from all of us? That we modify the incorrect past values and add new values continously everyday in this file ?

I'am not an expert in github and neither in collaborative work at that scale , but does that means hundreds of evolutions everyday (and as numerous PR) + issues in the comparative hours for the different country + the risk that so many people creating lines create as much issues in format ?

owahltinez commented 4 years ago

Good questions! Thanks for taking a look.

For example, the data for the 12th that were criticized in this repo ( as you pointed out) are the same in your repo : so what do you want from all of us?

It is still unknown how the data in this repo is updated. The maintainers do not appear to accept pull requests from external contributors (so far). I propose to put our efforts towards curating a new dataset where everyone who wants to contribute, can contribute.

I'am not an expert in github and neither in collaborative work at that scale , but does that means hundreds of evolutions everyday (and as numerous PR) + issues in the comparative hours for the different country + the risk that so many people creating lines create as much issues in format ?

The alternative dataset I created is closer to a "true" time series, with each datapoint representing an individual event. Because each line is a different event, and Git is pretty good at doing diffs in text files, I expect collision to be be minimal. That's in contrast to the dataset in this repo where each datapoint is a column added to the CSV file, which I don't find to be an appropriate format for time series data.

Ideally, the WHO will begin publishing data in a more consumable format. Or at least they will be consistent in the formatting of their PDF reports. Until then, we will have to figure out a way to scrape data from many different sources and not all of that can be automated (at least reliably); so crowd-sourced data seemed like the way to go in such a time-critical problem.

JiPiBi commented 4 years ago

To limit the issues, perhaps need corespondents for a country or a bunch of countries (like US people try to organize in one of the issues).

Is there an explanation somewhere for this sort of collaborative work because I dont understand how it could work :

So many questions, but interesting ones .....

owahltinez commented 4 years ago

The explanation is in the README of the new repo: https://github.com/open-covid-19/data

Essentially, the idea is that someone who finds an issue with the data can just edit it via pull request, which can be reviewed by anyone else. If enough people are interested, more people can be added with rights to accept pull requests.

is there an official file for people who only exploit results and a daily file filled in at first as a draft by the community and that someone in the community checks at the end of the day and agregate to the official file ?

Yes, that would be the files under https://github.com/open-covid-19/data/tree/master/output. I also added a JSON format so people can use it directly from their Javascript applications.

if the draft daily file is opened by someone , it must be blocked till that the file is saved and quitted . So you must wait for the avaibility of the file ? (if it was a real database like access is , but with another app on the web, perhaps it could avoid this issue, because it permits multiple simultaneous modifications on different records)

It's not a database, just a plain text CSV file which we can collaborate using Git and Github's pull requests for updates and conflict resolution.

How do you manage the time limit to fill in data for a given day for the whole community

Data can be backfilled if necessary. For example, I'm slowly fixing data for 2020-03-12 for the countries where I can find an official data source.

Does it need as many pull requests as modifications in the draft of the daily file ?

Ideally, each pull request uses a single source with updated numbers. If a source updates many countries (like the WHO situation report) then it's not necessarily 1 pull request per datapoint but, yes, there is a pull request / commit for each modification.

JiPiBi commented 4 years ago

If you are searching for valid data , it seems to me that this one is quite serious https://qap.ecdc.europa.eu/public/extensions/COVID-19/COVID-19.html

owahltinez commented 4 years ago

That looks very interesting. I only see a dashboard, where is the source of the data?

JiPiBi commented 4 years ago

difficult to know where is the source , on certain graphs below you can see on the upper right an icon to access to data selected , but fr the moment I didnt succeed into exporting data and the values seem to be published at 13:00 , I have to make check to understand how they manage that , but I have not the nime now See you later

dmamalis commented 4 years ago

That looks very interesting. I only see a dashboard, where is the source of the data?

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

;-)

owahltinez commented 4 years ago

Woah! This is perfect. I'll replace my dataset's source data with this, and pull daily. Thanks @dmamalis !

greg-minshall commented 4 years ago

here's another, probably not so-well maintained (if i do say so myself), aggregated and somewhat modified version of the JHU files.

https://gitlab.com/minshall/covid-19

JochemDeen commented 4 years ago

Looks way better!

benjiqq commented 4 years ago

@owahltinez any ideas how to create more awareness for independently sourced ones which leverage the wisdom of crowd?

DavidGeeraerts commented 4 years ago

See issue #558 The data will be posted here: http://blog.lazd.net/coronadatascraper/

matjung commented 4 years ago

The number of data slave providers is going up. Just some sources are getting updated only once a day. This one reveals the name of the Excel file that can be downloaded from EU www.ecdc.europa.eu html readme https://micro-work.net/covid json get https://micro-work.net/covid/ecdcweb.php {"response":{"source":"https:\/\/www.ecdc.europa.eu\/en\/publications-data\/download-todays-data-geographic-distribution-covid-19-cases-worldwide","urlexcel":"https:\/\/www.ecdc.europa.eu\/sites\/default\/files\/documents\/COVID-19-geographic-disbtribution-worldwide-2020-03-15.xls","excel":"COVID-19-geographic-disbtribution-worldwide-2020-03-15.xls"}}

A service to get the Excel content in CSV / JSON Format is work in progress.

greg-minshall commented 4 years ago

CSV/JSON seem really desirable to me.