Just accidentally (when reviewing my inpspection-utility) I found some tiny issues...

gottfriedhelms commented 4 years ago

That issues are tiny bugs, at most, but maybe indicate an issue with your automatized adaption of the JHU-files. I came across this when I saw that the "Aruba"-data in your "combined-values" seem to have not been optimally adapted. The JHU-originals have on two days double-reporting but different confirmed-values. Also the data-record from 11.03 seems to be missed in your CIP -Dataset.
See the inspection-forms parallel for your CIP-data and for the JHU-data.
aruba_base The upper-right form is the inspection-tool for the JHU-serial-data and the lower-left form is on your "combined"-data. In the JHU-serial-data-form I've also documented the uncorrected original country-province-date entried for better reference and detection of errors in conversion process itself. They are also arranged per daily file (see column "filenr"), and this shows, that the same record has been repeated to more sequential files by JHU.

First problem occured with data-record of 11.03. This does not occur in your dataset at all. Maybe a bug in your import-routine? see here: aruba_11_03

Next is the data of 18.03. In JHU Aruba is now documented two times, and even with different values (2 and 4) in the same daily file! You've combined that to get 6 cases, which might be sensical. But at the next file JHU simply repeats that double-reporting, but from day (20.03) they combine. But now they increase either from 2 or from 4 to 5, and you get now the same number 5 correctly ---- but that means, in your dataset the numbers decrease from 6 to 5 - which gives errors in graphs with logarithmic scales!

aruba_18_03

This should, at the moment, be only small reminders, I'm not in a intense verifying/checking process. Also, this stems from you data from 25.03 and might not occur in more current datasets.

If I find something more like this I'll add observations here.

cipriancraciun commented 4 years ago

Thanks for reporting.

Regarding the Aruba case, apparently this issue was reported multiple times in the JHU repository, but no action was taken by them:

Myself I've also noticed a major issue with French Polynesia that basically breaks any model for that region (basically they've reported in French Polynesia all the cases for France for a day):

Now, I've thought a little bit about how to handle these inconsistencies:

should I override manually these inconsistencies? this will require a lot of effort on my behalf, and I'm not sure I could cover everything, therefore the actual outcome is the same: the resulting dataset can't be believed to be more accurate than the original JHU one;
should I automatically drop any rows that take values downward? should I drop the next values that are smaller, or should I drop the previous value that has a peak? I think this would probably generate more issues than it solves;
should I average the values so that any inconsistencies are smoothed out? I think this falls in the same category as above;
should I do nothing and just provide the data as it is from the original source?

And my current decision is the last one do nothing because in the end my main goal is to provide the data in a more "usable" format; my goal is not to meddle with the data.

That being said, I do ponder about introducing another dataset that takes the original data from JHU (and the others) and "smooths" inconsistencies.

I still have to think about the mathematical model that I would need to employ, but I could lay out the following requirements:

the cumulative values to date should be the same as in the original dataset; (i.e. the total should be the same;)
the absolute difference between the original data and the "smoothed" data shouldn't be more than a certain threshold, something like 10%; if it passes over that, then the data point should be better removed;
the relation between confirmed, recovered and deaths should be kept within the same limit; (i.e. the smoothing shouldn't generate inconsistencies in the ratios for a given day;)
the sums in an average window of say 7 days should differ in a similar manner over a certain threshold; (so that analysis done over an window larger than that won't return untrue results;)
should be easy to implement, requiring nothing more than a moving window of a certain size of the values, and basic arithmetic operations; (because I need to implement it in Python / jq and don't want to drag in complex mathematical libraries;)

How can I achieve this? I don't know? Perhaps you have a suggestion;

(My thought is a moving window average?)

(I'll post another reply about the other issue.)

cipriancraciun commented 4 years ago

Regarding the missing data, checking the latest files (as of April 18th), the first Aruba case is on 13th March. It appears correctly in my dataset, i.e. as is in the JHU files. Thus it might be an issue that was corrected later on. (I would strongly suggest using the latest files.)

However you are right, there are gaps in my dataset, for example in case of Aruba, I have the dates 13, 17, 18, 20, 22, 24, etc. for March. (The other dates in between are missing.)

But these are by design because the actual values (for any of the metrics confirmed, deaths and recovered) have not changed due to any of the following reasons:

there were really no changes (i.e. no-one new was confirmed, recovered, etc.);
the data was not actually reported by that country;
the data was not actually collected by JHU;

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

gottfriedhelms commented 4 years ago

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

Yes, this is, i think, the best decision. Only for the timeseries-data it might be recommended to "fill-the-gaps" because the user might not have an easy instrument to do this oneself (I can use msaccess with its "crosstabs"-function, which can provide the jhu-timeseries-format instantly).
In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

Again - applause to your big work here, Ciprian!

cipriancraciun commented 4 years ago

In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

That is why I also imported the ECDC (for global data) and NY Times (for US data) as an alternative for JHU.

Thus I would strongly advise you to also use the ECDC derived data. (It uses exactly the same format as the JHU derived one.)

cipriancraciun / covid19-datasets

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues... #36