cipriancraciun / covid19-datasets

COVID-19 derived and augmented datasets (based on JHU, NY Times, ECDC) exported as JSON, TSV, SQL, SQLite DB (plus visualizations)
https://scratchpad.volution.ro/ciprian/eedf5eb117ec363ca4f88492b48dbcd3/
25 stars 5 forks source link

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues... #36

Closed gottfriedhelms closed 4 years ago

gottfriedhelms commented 4 years ago

That issues are tiny bugs, at most, but maybe indicate an issue with your automatized adaption of the JHU-files. I came across this when I saw that the "Aruba"-data in your "combined-values" seem to have not been optimally adapted. The JHU-originals have on two days double-reporting but different confirmed-values. Also the data-record from 11.03 seems to be missed in your CIP -Dataset.
See the inspection-forms parallel for your CIP-data and for the JHU-data.
aruba_base The upper-right form is the inspection-tool for the JHU-serial-data and the lower-left form is on your "combined"-data. In the JHU-serial-data-form I've also documented the uncorrected original country-province-date entried for better reference and detection of errors in conversion process itself. They are also arranged per daily file (see column "filenr"), and this shows, that the same record has been repeated to more sequential files by JHU.

First problem occured with data-record of 11.03. This does not occur in your dataset at all. Maybe a bug in your import-routine? see here: aruba_11_03

Next is the data of 18.03. In JHU Aruba is now documented two times, and even with different values (2 and 4) in the same daily file! You've combined that to get 6 cases, which might be sensical. But at the next file JHU simply repeats that double-reporting, but from day (20.03) they combine. But now they increase either from 2 or from 4 to 5, and you get now the same number 5 correctly ---- but that means, in your dataset the numbers decrease from 6 to 5 - which gives errors in graphs with logarithmic scales!

aruba_18_03

This should, at the moment, be only small reminders, I'm not in a intense verifying/checking process. Also, this stems from you data from 25.03 and might not occur in more current datasets.

If I find something more like this I'll add observations here.

cipriancraciun commented 4 years ago

Thanks for reporting.

Regarding the Aruba case, apparently this issue was reported multiple times in the JHU repository, but no action was taken by them:

Myself I've also noticed a major issue with French Polynesia that basically breaks any model for that region (basically they've reported in French Polynesia all the cases for France for a day):


Now, I've thought a little bit about how to handle these inconsistencies:

And my current decision is the last one do nothing because in the end my main goal is to provide the data in a more "usable" format; my goal is not to meddle with the data.


That being said, I do ponder about introducing another dataset that takes the original data from JHU (and the others) and "smooths" inconsistencies.

I still have to think about the mathematical model that I would need to employ, but I could lay out the following requirements:

How can I achieve this? I don't know? Perhaps you have a suggestion;

(My thought is a moving window average?)


(I'll post another reply about the other issue.)

cipriancraciun commented 4 years ago

Regarding the missing data, checking the latest files (as of April 18th), the first Aruba case is on 13th March. It appears correctly in my dataset, i.e. as is in the JHU files. Thus it might be an issue that was corrected later on. (I would strongly suggest using the latest files.)


However you are right, there are gaps in my dataset, for example in case of Aruba, I have the dates 13, 17, 18, 20, 22, 24, etc. for March. (The other dates in between are missing.)

But these are by design because the actual values (for any of the metrics confirmed, deaths and recovered) have not changed due to any of the following reasons:

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

gottfriedhelms commented 4 years ago

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

Yes, this is, i think, the best decision. Only for the timeseries-data it might be recommended to "fill-the-gaps" because the user might not have an easy instrument to do this oneself (I can use msaccess with its "crosstabs"-function, which can provide the jhu-timeseries-format instantly).
In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

Again - applause to your big work here, Ciprian!

cipriancraciun commented 4 years ago

In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

That is why I also imported the ECDC (for global data) and NY Times (for US data) as an alternative for JHU.

Thus I would strongly advise you to also use the ECDC derived data. (It uses exactly the same format as the JHU derived one.)