cipriancraciun / covid19-datasets

COVID-19 derived and augmented datasets (based on JHU, NY Times, ECDC) exported as JSON, TSV, SQL, SQLite DB (plus visualizations)
https://scratchpad.volution.ro/ciprian/eedf5eb117ec363ca4f88492b48dbcd3/
25 stars 5 forks source link

Observations on `v1-*.tsv` -- proposal to repair inconsistencies #7

Closed gottfriedhelms closed 4 years ago

gottfriedhelms commented 4 years ago

Hi Ciprian - I've downloaded your nice file and have imported it into Excel and Msaccess.
With the query-tools in msaccess I've looked a bit at consistency of country-/province-issues that I've seen already in the JHU-original datasets. Something like

Have been just peeking into the dataset, no rigorous protocol so far. If you don't think this is of concern for the real use of the dataset I'm fine with it, but I can also try to contribute a more detailed protocol of (possible/guesses) issues.

cipriancraciun commented 4 years ago

Thank you for the interest! One of the main reasons I've created this derived dataset is to cleanup and correct inconsistencies. Therefore any help is welcome.

So, how to handle this:

cipriancraciun commented 4 years ago

BTW, which fields are you looking at? (There are country, province and administrative, which are the "cleaned" ones; but until yesterday I also put country_original, province_original and administrative_original which gave the original values.)

gottfriedhelms commented 4 years ago

Ah, you also appended the original fields contents to your file, I just wanted to propose this. I can at the moment not detail on your conceptual aspects in your first comment, but possibly, to see what issues I've found in the JHU-files from 20-01-22 to 20-03-25 I could send you my corrected file, which contain the country/province in changed an in original form, which also allows to find all my automated changes with some query ("select where field_new <>field_orig" or so) . For reference to the JHU_original history I also prefixed the records in my file with the JHU-filenr and JHU_recnr-in-file and also with a new individual recId. (Or I can send the file with only the records after filtering for differences, but then the logic/reason of the editing might be intransparent in some cases) .

If my take is useful at all I'd propose I do a rigid [country];[province];[administrative] inspection with my msaccess-tools (like that of which I've posted a screenshot with my first comment to you) and either a protocol of the issues found/assumed or just an edited version of your file - and you have then the option to reflect on this file-differences.

cipriancraciun commented 4 years ago

OK, regarding Aruba and Saint Barthelemy, apparently they are both "countries" and "provinces" at the same time, according to Wikipedia:

They are "countries" in the sense that they have international country codes, but at the same time are not independent states.

So, without getting into geo-politics, I'll do what I did with Hong-Kong and the Palestinian states: if a province has an international country code, then I'll treat it as a country. This also helps with the consistency of geographical zones, because in case of Aruba and st. Barthelemy they actually contribute counts towards Americas not towards Europe.

(I'm updating my lookup tables right now, and I'll look for other provinces that might be "countries" in the sense described above.)

gottfriedhelms commented 4 years ago

A possible rationale for the association of "country" or "province" in such ambiguous cases might be, whether they have the health-organization under control of the "motherland" or not, because this might be important for the judging the taken measures by the appropriate governmental health system. But I feel not firm with that perspective at all, I've never delved deeper in this in my active professional research time...

cipriancraciun commented 4 years ago

Or I can send the file with only the records after filtering for differences, but then the logic/reason of the editing might be intransparent in some cases.

That would be helpful. (Please put the file in a Google Drive / Dropbox, anywhere and give me the link.) Also please make it in TSV/CSV/Excel format, as I don't have MS-Access. (I'm using Linux.)


Or what would be even more useful, is if I would upload to this repository a list of location re-mappings, and filter your changes to apply only to this. (I say this, because if you did your work on the original JHU dataset, I might have already solved some issues in there).

gottfriedhelms commented 4 years ago

To your last comment: I've done consistency checks on the JHU-original files, while on the same time restructured to .tsv - file , but my current comments apply to the file I've downloaded from yours (let's call it "cip-file" accordingly). We might talk about JHU, JHU-g, CIP and CIP-g files with "-g" for the by me edited versions.

cipriancraciun commented 4 years ago

OK, then please send me both JHU-g and CIP-g files.

And I'll take a look and apply the CIP-g file, while the JHU-g file I'll keep for reference.

gottfriedhelms commented 4 years ago

For the more thorough preparing of the CIP-g file I need the afternoon or evening. I'll post a comment here then.

cipriancraciun commented 4 years ago

In order to aid you, I have uploaded to the repository a complete list of all the unique locations (across all three datasets, JHU, NY Times and ECDC) and the way I have mapped them:

The primary key should be (key_original, type) or (key_original, dataset).

I agree of ignoring the administrative level entries, thus only select those with type either country or province.

If you use this file, please insert an additional column like country_remapping and province_remapping and only fill it if the value currently in country_normalized or province_normalized is not correct.

(I'll now take a look at the files you've uploaded.)

cipriancraciun commented 4 years ago

I've taken a quick look at JHU-g 0325 and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)

cipriancraciun commented 4 years ago

@gottfriedhelms You were right about Bavaria, I initially assumed that for Germany other provinces have also been reported. But given that only Bavaria was singled-out, I've now removed it and merged it into Germany.

I have also mapped all the Denmark and Netherlands provinces to their respective countries.

Moving now to inspect UK, France, and Australia.

gottfriedhelms commented 4 years ago

I've taken a quick look at JHU-g 0325 and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)

Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...

cipriancraciun commented 4 years ago

Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...

From corona_acc_0325_g.csv, for example: 459, 471 to 476, 529, and others.

cipriancraciun commented 4 years ago

OK, outside of US, only the following provinces have been left non-normalized:

Australia / Australian Capital Territory
Australia / External territories
Australia / Jervis Bay Territory
Australia / (mainland)
Australia / New South Wales
Australia / Northern Territory
Australia / Queensland
Australia / South Australia
Australia / Tasmania
Australia / Victoria
Australia / Western Australia
Canada / Alberta
Canada / British Columbia
Canada / Calgary, Alberta
Canada / Edmonton, Alberta
Canada / London, ON
Canada / (mainland)
Canada / Manitoba
Canada / Montreal, QC
Canada / New Brunswick
Canada / Newfoundland and Labrador
Canada / Northwest Territories
Canada / Nova Scotia
Canada / Ontario
Canada / Prince Edward Island
Canada / Quebec
Canada / Saskatchewan
Canada / Toronto, ON
Canada / Yukon
China / Anhui
China / Beijing
China / Chongqing
China / Fujian
China / Gansu
China / Guangdong
China / Guangxi
China / Guizhou
China / Hainan
China / Hebei
China / Heilongjiang
China / Henan
China / Hubei
China / Hunan
China / Inner Mongolia
China / Jiangsu
China / Jiangxi
China / Jilin
China / Liaoning
China / Ningxia
China / Qinghai
China / Shaanxi
China / Shandong
China / Shanghai
China / Shanxi
China / Sichuan
China / Tianjin
China / Tibet
China / Xinjiang
China / Yunnan
China / Zhejiang

I think none of these are actually independent "countries" as discussed in previous comments.

Thus @gottfriedhelms if you do any comparisons, please use the latest version of the locations-diff.tsv.

gottfriedhelms commented 4 years ago

Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...

From corona_acc_0325_g.csv, for example: 459, 471 to 476, 529, and others.

Ah, thanks, just found the problem and its root. It's in my csv-reader routine, mishandling quoted strings in incoming fields. I'll have to update my routine. I'll re_upload my files when ready.
I'll make an excel-2000 version of JHU-g 0325 available on my webspace in some minutes.

gottfriedhelms commented 4 years ago

Thus @gottfriedhelms if you do any comparisons, please use the latest version of the locations-diff.tsv.

Thanks for the reminder... I'll start investigating that file seriously in the evening or tomorrow morning. (Holy quarantine ... :-) )

gottfriedhelms commented 4 years ago

I've taken a quick look at JHU-g 0325 and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)

An Excel-file JHU-g 0325 is (zipped) on my webspace: JHU-g 0325 excel2000

gottfriedhelms commented 4 years ago

Using your 'combined'-file -v1.tsv of today morning I got only the following handful odd observations:

Maybe these are all wanted and meaningful entries, I just report them by my look&feel ; I find that dataset even with that possible bugs already superb and much over the JHU-originals!

Thank you thousand-fold for your big effort!

cipriancraciun commented 4 years ago

OK, so there are multiple issues here:

cipriancraciun commented 4 years ago

I have added some code that will warn me in future if any province "appears" to be the name of a "country".

(I've also corrected Guam, again US Virgin Islands and another US dependency.)

cipriancraciun commented 4 years ago

@gottfriedhelms If there aren't any more issues with this, I would rather close it tomorrow.

If you find new problems, and I close it, please re-open it.

gottfriedhelms commented 4 years ago

Am 31.03.2020 um 19:35 schrieb Ciprian Dorin Craciun:

@gottfriedhelms https://github.com/gottfriedhelms If there aren't any more issues with this, I would rather close it tomorrow.

If you find new problems, and I close it, please re-open it.

Dear Ciprian -

I didn't find any more issues and think there are no more noteworthy oddities in your dataset.


On my side it was a nice experience to develop such a data-parser myself, some refreshing of my old rusted programming abilities (mainly for statistical software), but since you provide such a much better curated dataset and also are willing to update daily, I don't think I need to continue (aside of the playground with my pattern-script).

Thank you very much for your engagement!

Information is always a resource for the emancipation of the man from ignorance; it should never be kept secret, in bureaucracy or in private stores -

Gottfried Helms