CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

Can the community help? #957

Open judepayne opened 4 years ago

judepayne commented 4 years ago

This data is the de facto standard for covering the COVID-19 outbreak. For me personally, the time series data in particular I can't find anywhere else and is invaluable in terms of understanding this. Yet I, like many others here raising PR's, am noticing (often quite minor) day to day problems with the data accuracy which are often not fixed retrospectively.

Is there anything the community can do you help you (Johns Hopkins) with the gathering and maintenance of this data? For example, could we write and maintain code to gathering the data and cross-check it more effectively? Could there be a larger group of maintainers who are able to examine and merge in all the open PR's?

What is everyone's thoughts on this?

S-Wallace-OH commented 4 years ago

Check #558

judepayne commented 4 years ago

Thanks - I read through everything. I'm personally not so interested in having deeper county-level data, more ensuring that the main timeseries data is accurate as much of the time as possible.

So, I volunteer on this repository to verify & help you merge in PR's. Whatever is the workflow.

Please let me know if you're interested.

Separately I will contact lazd to see if I could help there. I notice that his code doesn't seem to have any cross checking - for example it could cross check a country-day's numbers against the current state of the wikipedia page (which has many community updates) and automatically raise an issue to this site if outside a threshold.

Bost commented 4 years ago

For example, could we write and maintain code to gathering the data and cross-check it more effectively? Could there be a larger group of maintainers who are able to examine and merge in all the open PR's?

Instead of this, I think the task of gathering data into one repo should be distributed and have a hierarchy. E.g. one person responsible for his/her continent. And - if needed, going deeper in the hierarchy: one person responsible for countries A,B,C another person for D,E,F etc.

judepayne commented 4 years ago

@Bost that sounds like a great idea.I would volunteer to initially take Europe

judepayne commented 4 years ago

@CSSEGISandData, @enshengdong, @hongru94 Please let us know if would you would like our efforts to help you handle the issues and PR's on this repo.

I have been in touch with @lazd here and he has added a cross checking section to the Scrape report that looks like this for when there are multiple sources of the same data.

Have also raised this for a further addition to this coronadatascraper so that it can post a automated github issue summarising differences, the idea being to get ahead of all the issues and PRs raised here.

Bost commented 4 years ago

@Bost that sounds like a great idea.I would volunteer to initially take Europe

@judepayne Then you need to step up and take over the initiative. And START. But it's gonna be a full time job. Be aware. In any case even if you're not available 24/7 you may start by raising awareness about our "how to scale" idea.

Edit:

@CSSEGISandData, @enshengdong, @hongru94 Please let us know

I guess we're overloaded and have not time for potentially fruitless discussions. You need to present some proper results and then they will listen.

Bost commented 4 years ago

@judepayne please have a look at https://github.com/CSSEGISandData/COVID-19/issues/1035#issuecomment-601337814

JiPiBi commented 4 years ago

I think that the links I indicated in 1035 are known by some of you

https://coronadatascraper.com/#sources
https://github.com/opencovid19-fr/data
https://github.com/pcm-dpc/COVID-19

You can perhaps read also #1046 for Germany

jgehrcke commented 4 years ago

Quick feedback from my side about data from Germany: I have been working on https://github.com/jgehrcke/covid-19-germany-gae, providing both, spatially coarse-grained and fine-grained data for Germany through clean interfaces. The README in that repository also discusses some insights about the freshness of the data coming from the various entities in Germany.

I contributed for starters a corresponding scraper for the spatially coarse-grained data for Germany to https://github.com/lazd/coronadatascraper. Will contribute the scraper(s) for more spatially fine-grained asap. I think that https://github.com/lazd/coronadatascraper is very promising. The quality of its README and website has certainly convinced me to contribute there.

I love this collaboration, it's a nice world that we live in. Cheers!

@bost @JiPiBI @lasz @judepayne I will probably not be available full time, but I want to help, have strong technical opinions and relevant data mangling expertise -- please never hesitate to reach out.

judepayne commented 4 years ago

Seems that there will be quite a few of us willing to help out!

I feel solving this is a case of making the overall workflow more sustainable by automating as many parts as possible and then scaling the non-automatable parts by using the community to handle the (reduced) volume of issues and PR's.

I also feel that this site is the standard for the global data - I've seen it cited everywhere - so would be very reluctant to start a parallel effort rather than direct our energies to making this data as accurate as possible.

We need to hear from the repo's maintainers...