Open cipriancraciun opened 4 years ago
Hi, I have been thinking along similar lines.. but reached a slightly different conclusion/ question.
Like many others here I also use the dataset for analytics but have found that due to the data quality issues I had to maintain a running set of patches of my own. I suspect the JHU guys are overwhelmed by the number of issues and PR's here since there seems to be little response.
As I understand it, the data here is gathered automatically by (an instance of?) the corona data scraper maintained by lazd.
I have reconciled this data here at country level against the wikipedia country level table (community maintained) and the table on worldometer. Both seem to be of better quality overall (but are not timeseries or in a format to make them so easily accessible). This implies that although the data can largely be gathered automatically that there's a need for some manual maintenance on top to keep it reliably clean.
My question is would be worthwhile having a repository that has the same automatically gathered data as here but allows a community of volunteers to do any fixing up required on top? We could assist the fixing up process by building automated reconciliations to other sources. I'm reluctant to move away from the JHU data since it's the defacto standard, but I think this represents a decent compromise.
Thoughts?
@judepayne Given that I am not affiliated to the JHU team I can't say for sure what happens behind the scenes. I can only speculate that they implement at best a form of semi-automatic supervised scraping system.
I myself have created a derived dataset (which I described at #1281 and available at https://github.com/cipriancraciun/covid19-datasets) based on the JHU data, which I've augmented myself with values from CIA Factbook (for countries population and area), Wikipedia (for US state population and area, and for county to state mapping).
Your suggestion of having an automated process in place, that then can be corrected by people might be a solution; however that would be a logistical nightmare as how would one validate a given correction? Perhaps one volunteer can take his own country and cross-check the values with the official statements, and if an discrepancy is found he would note it in an issue for that particular country (no duplicate issues for the same country; all discrepancies should go in that issue;)
In fact I think I can easily re-purpose my scripts to apply such "patches", but the question of validity still remains.
Now that I think about it, how about if I create a set of HTML files, one for each country, listing the values per day. If one finds an issue he can click a link that would automatically create an GitHub issue with a template contents (which pre-fills the country/province/administrative-unit and the date, so it can be easily tracked.)
There is a documentation describing such kind of links:
I have already started tracking down duplicate issues and pointing people to the one ticket that seems to contain the most comments, and kindly asking the original poster to follow that other one, and close its own duplicate.
Some help in this endeavor would be nice; the following two links search for open issues with the following keywords:
recovered
(~200 issues at the moment) -- https://github.com/CSSEGISandData/COVID-19/issues?q=is%3Aissue+is%3Aopen+recoveredsummary
(and related) (~60 at the moment) -- https://github.com/CSSEGISandData/COVID-19/issues?q=is%3Aissue+is%3Aopen+%28combined+OR+summary+OR+summarization+OR+totals%29@CSSEGISandData, @hongru94, @enshengdong, @arthurzhang434 (as owners of this repository, or at least as committers), would you consider tidying up the current issues in this repository?
I'm asking because as it stands out, when people want to submit an issue, when they see how many there are, they don't even bother searching for duplicates, and just keep submitting new ones...
As it stands out, I would suggest closing all tickets that are older than two days:
is:issue is:open updated:<=2020-03-25
(or the link to GitHub search), which at the moment yields ~800 open issues;I could go over the remaining ones in the last two days, and try to sort duplicates and ask people to either follow another one, or if the problem is not present anymore to just close it themselves:
is:issue is:open updated:>=2020-03-26
(the link to GitHub search);Afterwards, I could keep trying to watch over new issues and point people to existing issues.
(Perhaps pin this, or another issue that kindly asks people to start looking for duplicates. Or alternatively we could use GitHub issue templates, which can be used for this purpose.)
Old Proposal #778
Hi @cipriancraciun, this is a noble cause but unfortunately I haven't seen much/ any evidence that issues or PR's are responded to or any interaction from the owners of this repo.
I've asked whether the corona data software is in fact used by JHU.
In terms of community maintenance with automation support, yes we could split along country/ region lines. I think if say 6-12 people were interested in helping out, it should be doable. It would be that maintenance team's job to spot error with some automated reports/ recs to help them and fix them, as well as respond to further issues raised by anyone. I think the trick is to think about having a pretty slick and sustainable workflow before we start.
In my experience, most of the issues here are fairly obvious (some failure in the gathering logic/ change in website layout etc) and the trick is fix them before they get laid into the timeseries record.
Minor errors are perhaps not so important as the events (new cases etc) in the real world are happening all of the time and the daily timeseries is just an arbitrary snapshot. If a small portion of those events are not captured in one day's record, they will tend to be captured in the next as totals data seems to be the prevalent form.
@judepayne In terms of community maintenance with automation support, yes we could split along country/region lines. I think if say 6-12 people were interested in helping out, it should be doable. It would be that maintenance team's job to spot error with some automated reports/ recs to help them and fix them, as well as respond to further issues raised by anyone. I think the trick is to think about having a pretty slick and sustainable workflow before we start.
I have already spent some time writing automated scripts (https://github.com/cipriancraciun/covid19-datasets) that take and transform the JHU dataset, and I've started working on integrating the NY Times dataset (for US) and I plan on including the ECDC dataset (for the rest of the world). So if you want to discuss this further I suggest moving this particular discussion on that repo.
However the JHU dataset still has valuable information and could be used to double-check other datasets.
I think the owners are too busy to deal with these issues, but outsiders (like us) are unable to solve the problem efficiently without the owners' help.
For example, I think making a list of duplicated issues/pinned some VIP issues/using labels will help people to stop raising duplicated issues and find what they want to know.
@cipriancraciun your site looks pretty good! Have a look at the https://github.com/lazd/coronadatascraper project as well and their website for hosting the data https://coronadatascraper.com/ (I've asked it is could also be made available on github). This project seems to already have an active contributor base for the software.
Perhaps a better solution for JHU team is to write clearly on the readme that they provide only the raw data, and other any groupings, summations, augmentations, etc. are out of scope. And perhaps point to a few other projects that provide these augmentations (with a disclaimer of non-endorsement).
I second @chAwater; there are people willing to help, but can't because only the original poster or the repository contributors can actually close issues...
Yesterday I've replied to at least a 30 tickets pointing people to duplicates and asking them to follow those and close the tickets. However this morning, I see new issues being opened with exactly the same topics, so I gave up... The maintainers of this repository need to step in and do something, else it will become a swamp.
(I still look over the issues, and if I see anyone that can be helped by my derived data-set I point them there.)
@judepayne I have already integrated the NY Times dataset, and working on integrating the European CDC, into my own, because I believe they have more man-power behind the collection and validation of the data points.
@cipriancraciun I agree with your suggestions that JHU should make clear that this site is for raw data only and provided 'as is' as they don't respond to issues.
Then there's the separate question of what do we do about it? Do we copy their data over to a new repo every day and let a set of volunteers maintain that, responding to issues etc? I am not sure it would get critical mass, but happy to discuss.
In terms of improved automatic data collection, please do look at the https://github.com/lazd/coronadatascraper project which is the open source data gathering effort. They've had 34+ contributors. You can contribute a scraper and then it appears in the data the next day.
@judepayne regarding the "what do we do about feature requests", yes, I suggest people that want to provide alternative formats or augmented variants do exactly that: copy the raw data in their own repositories (for repetability of the processing) and provide their own re-processed data for others to use, then update it daily. (At least this is what I did.)
Regarding the scraper pointed project, at the moment I have no opinion, and as I don't do scraping myself I don't think I can contribute to that project with anything useful. At some point, if it proves of enough quality, I could include it in my re-processed / aggregated dataset (as I did for JHU, NY Times and ECDC).
First of all I would like to thank the JHU team for the effort they have put behind this enormous dataset. (Yes there are inconsistencies, errors, omissions and the like, but this is to be expected given that many of the official reports are actually PDF files that have to be manually scraped.)
Therefore, given how many duplicate issues I've seen I would propose the following:
first of all before submitting a new issue report, please look at least in the current page of existing open issues if perhaps that issue wasn't already reported, and if it was, contribute and subscribe to that;
if an issue you have opened has already been solved (or isn't of interest for you anymore), please close it;
if you spot a duplicate issue of an earlier one, kindly point via a message on the new duplicate issue thread the number of the earlier one, and ask kindly ask the author to close it;
This will help the JHU team in pin-pointing important issues which still aren't solved yet. Because as the issue tracker stands, it is practically impossible for anyone to take any useful action, especially given the pressure on the JHU team to gather new data (and correct existing one).
As a disclaimer I am not affiliated with the JHU team, but I do use their dataset.