Closed gottfriedhelms closed 4 years ago
Thank you for the interest! One of the main reasons I've created this derived dataset is to cleanup and correct inconsistencies. Therefore any help is welcome.
So, how to handle this:
total-country
, total-region
, total-subregion
, and only for US total-province
; thus I would let alone the province
records mainly for tractability of how the values were computed;BTW, which fields are you looking at? (There are country
, province
and administrative
, which are the "cleaned" ones; but until yesterday I also put country_original
, province_original
and administrative_original
which gave the original values.)
Ah, you also appended the original fields contents to your file, I just wanted to propose this. I can at the moment not detail on your conceptual aspects in your first comment, but possibly, to see what issues I've found in the JHU-files from 20-01-22 to 20-03-25 I could send you my corrected file, which contain the country/province in changed an in original form, which also allows to find all my automated changes with some query ("select where field_new <>field_orig" or so) . For reference to the JHU_original history I also prefixed the records in my file with the JHU-filenr and JHU_recnr-in-file and also with a new individual recId. (Or I can send the file with only the records after filtering for differences, but then the logic/reason of the editing might be intransparent in some cases) .
If my take is useful at all I'd propose I do a rigid [country];[province];[administrative] inspection with my msaccess-tools (like that of which I've posted a screenshot with my first comment to you) and either a protocol of the issues found/assumed or just an edited version of your file - and you have then the option to reflect on this file-differences.
OK, regarding Aruba and Saint Barthelemy, apparently they are both "countries" and "provinces" at the same time, according to Wikipedia:
They are "countries" in the sense that they have international country codes, but at the same time are not independent states.
So, without getting into geo-politics, I'll do what I did with Hong-Kong and the Palestinian states: if a province has an international country code, then I'll treat it as a country. This also helps with the consistency of geographical zones, because in case of Aruba and st. Barthelemy they actually contribute counts towards Americas not towards Europe.
(I'm updating my lookup tables right now, and I'll look for other provinces that might be "countries" in the sense described above.)
A possible rationale for the association of "country" or "province" in such ambiguous cases might be, whether they have the health-organization under control of the "motherland" or not, because this might be important for the judging the taken measures by the appropriate governmental health system. But I feel not firm with that perspective at all, I've never delved deeper in this in my active professional research time...
Or I can send the file with only the records after filtering for differences, but then the logic/reason of the editing might be intransparent in some cases.
That would be helpful. (Please put the file in a Google Drive / Dropbox, anywhere and give me the link.) Also please make it in TSV/CSV/Excel format, as I don't have MS-Access. (I'm using Linux.)
Or what would be even more useful, is if I would upload to this repository a list of location re-mappings, and filter your changes to apply only to this. (I say this, because if you did your work on the original JHU dataset, I might have already solved some issues in there).
To your last comment: I've done consistency checks on the JHU-original files, while on the same time restructured to .tsv - file , but my current comments apply to the file I've downloaded from yours (let's call it "cip-file" accordingly). We might talk about JHU, JHU-g, CIP and CIP-g files with "-g" for the by me edited versions.
OK, then please send me both JHU-g
and CIP-g
files.
And I'll take a look and apply the CIP-g
file, while the JHU-g
file I'll keep for reference.
Here is JHU-g up to 20-03-21 (zipped) JHU-g 0321.
A .tsv-version with edits only, fields reordered for better display JHU_0321_g_edits_only. All records where either [country]<>[orig_country] or [province]<>[orig_province], ordered by [country],[province],[jhu-filenr],[jhu-file-recnr] (output format generated by msaccess-exportfunction) -remark: Checking this again visually some edits might have been not useful/not most sensical and are open to be altered after discussion.
Here is JHU-g up to 20-03-25 (zipped) JHU-g 0325. (I've not attempted to adapt/check the new admin-level US-entries).
Here is the scriptfile for automated editing of the JHU (accumulated) daily files up to 0321 resp 0325 (zipped) script 0325
For the more thorough preparing of the CIP-g file I need the afternoon or evening. I'll post a comment here then.
In order to aid you, I have uploaded to the repository a complete list of all the unique locations (across all three datasets, JHU, NY Times and ECDC) and the way I have mapped them:
The primary key should be (key_original, type)
or (key_original, dataset)
.
I agree of ignoring the administrative level entries, thus only select those with type
either country
or province
.
If you use this file, please insert an additional column like country_remapping
and province_remapping
and only fill it if the value currently in country_normalized
or province_normalized
is not correct.
(I'll now take a look at the files you've uploaded.)
I've taken a quick look at JHU-g 0325
and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)
@gottfriedhelms You were right about Bavaria, I initially assumed that for Germany other provinces have also been reported. But given that only Bavaria was singled-out, I've now removed it and merged it into Germany.
I have also mapped all the Denmark and Netherlands provinces to their respective countries.
Moving now to inspect UK, France, and Australia.
I've taken a quick look at
JHU-g 0325
and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)
Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...
Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...
From corona_acc_0325_g.csv
, for example: 459, 471 to 476, 529, and others.
OK, outside of US, only the following provinces have been left non-normalized:
Australia / Australian Capital Territory
Australia / External territories
Australia / Jervis Bay Territory
Australia / (mainland)
Australia / New South Wales
Australia / Northern Territory
Australia / Queensland
Australia / South Australia
Australia / Tasmania
Australia / Victoria
Australia / Western Australia
Canada / Alberta
Canada / British Columbia
Canada / Calgary, Alberta
Canada / Edmonton, Alberta
Canada / London, ON
Canada / (mainland)
Canada / Manitoba
Canada / Montreal, QC
Canada / New Brunswick
Canada / Newfoundland and Labrador
Canada / Northwest Territories
Canada / Nova Scotia
Canada / Ontario
Canada / Prince Edward Island
Canada / Quebec
Canada / Saskatchewan
Canada / Toronto, ON
Canada / Yukon
China / Anhui
China / Beijing
China / Chongqing
China / Fujian
China / Gansu
China / Guangdong
China / Guangxi
China / Guizhou
China / Hainan
China / Hebei
China / Heilongjiang
China / Henan
China / Hubei
China / Hunan
China / Inner Mongolia
China / Jiangsu
China / Jiangxi
China / Jilin
China / Liaoning
China / Ningxia
China / Qinghai
China / Shaanxi
China / Shandong
China / Shanghai
China / Shanxi
China / Sichuan
China / Tianjin
China / Tibet
China / Xinjiang
China / Yunnan
China / Zhejiang
I think none of these are actually independent "countries" as discussed in previous comments.
Thus @gottfriedhelms if you do any comparisons, please use the latest version of the locations-diff.tsv
.
Could you please provide the recordnumber (first field in JHU-g 0325) of one of the jumbled cases? I've to inspect my software-code...
From
corona_acc_0325_g.csv
, for example: 459, 471 to 476, 529, and others.
Ah, thanks, just found the problem and its root. It's in my csv-reader routine, mishandling quoted strings in incoming fields. I'll have to update my routine. I'll re_upload my files when ready.
I'll make an excel-2000 version of JHU-g 0325 available on my webspace in some minutes.
Thus @gottfriedhelms if you do any comparisons, please use the latest version of the
locations-diff.tsv
.
Thanks for the reminder... I'll start investigating that file seriously in the evening or tomorrow morning. (Holy quarantine ... :-) )
I've taken a quick look at
JHU-g 0325
and for some reason the CSV/TSV format seems jumbled for some records. (Try to use only tab separated without quoting, or upload an Excel file.)
An Excel-file JHU-g 0325 is (zipped) on my webspace: JHU-g 0325 excel2000
Using your 'combined'-file -v1.tsv of today morning I got only the following handful odd observations:
[country]
"United states Virginia Islands" occuring with [loc-type] "country"and "total-country
" and at the same time [country]
"United States" , [province]
"Virgin Island" with type "province" and "total-province"
[country]
"St Barthelemy" has the first "e" as unicode-character (as in the wikipedia, but shown as "Barthélemy"). Is this really better than to give an "e" ? I don't know what usual character-based search-routines do with this?
Same as above with [country]
"Reunion" and [country]
"Curacao"
"Puerto Rico" occurs in [country]
as well as in [country]
"United states" [province]
"Puerto Rico", (but each occurence with an appropriate [loc-type]
) May be this is in order due to your general definitions/intentions
The britsh islands around GB like "Isle of Man" and "Guernsey" are used as [country]
: is this really meaningful in the sight of the use of the field for [province]
in the United-states casees?
in the canadian entries I've earlier seen, that for instance with entries from "ontario" (sub entries "London, ON" and another one ) one of them was initially used for all "Canada" documentation, but since one daily file this documentation changed to show only the whole "Ontario" data, so possibly one of it should be renamed and the other one should be deleted. This concerns only few cases, but might smooth a bit the whole "Ontario" -documentation. Similar with province "Alberta" and its two sub-entries. Didn't check the full set of province documentation so far.
There are datasets with completely empty [country] [province] [administrative]
. I didn't check until now whether their contents could be recreated by inspection of the source-file of JHU (as far as they are JHU-records, didn't see this with my current tool, better tool is only ready today evening)
Maybe these are all wanted and meaningful entries, I just report them by my look&feel ; I find that dataset even with that possible bugs already superb and much over the JHU-originals!
Thank you thousand-fold for your big effort!
OK, so there are multiple issues here:
I've remapped Puerto Rico
and US Virgin Islands
as "countries", as per the previous discussion;
regarding the GB dependencies, as per the same decision (i.e. to use "dependent" provinces as "countries" when they have international country codes), the Island of Man
(and other GB dependencies) are mapped as countries; (thus no action taken here;)
regarding Canada, Australia and China (the only countries besides US that have "provinces" in the dataset), I choose not to meddle with the names given by the JHU team, mainly because I don't have a province dictionary for those countries; (thus no action taken here yet;) (although I do suggest people to ignore all provinces except US, and just use the total-country
records;)
regarding the fields with an empty value for both country
, province
and administrative
-- these are mainly cruise ships that were randomly assigned to a country or another; they have the location_type
of unknown
, and the label
field lists them as Cruise ship
;
Regarding the Saint Barthelemy
, Reunion
, Curacao
the files are in UTF-8
, thus you should just import them as UTF-8
; (most likely you are importing them in a "code page" which yields these issues;) however I've rewritten them to use ASCII only characters;
I have added some code that will warn me in future if any province "appears" to be the name of a "country".
(I've also corrected Guam
, again US Virgin Islands
and another US dependency.)
@gottfriedhelms If there aren't any more issues with this, I would rather close it tomorrow.
If you find new problems, and I close it, please re-open it.
Am 31.03.2020 um 19:35 schrieb Ciprian Dorin Craciun:
@gottfriedhelms https://github.com/gottfriedhelms If there aren't any more issues with this, I would rather close it tomorrow.
If you find new problems, and I close it, please re-open it.
Dear Ciprian -
I didn't find any more issues and think there are no more noteworthy oddities in your dataset.
On my side it was a nice experience to develop such a data-parser myself, some refreshing of my old rusted programming abilities (mainly for statistical software), but since you provide such a much better curated dataset and also are willing to update daily, I don't think I need to continue (aside of the playground with my pattern-script).
Thank you very much for your engagement!
Information is always a resource for the emancipation of the man from ignorance; it should never be kept secret, in bureaucracy or in private stores -
Gottfried Helms
Hi Ciprian - I've downloaded your nice file and have imported it into Excel and Msaccess.
With the query-tools in msaccess I've looked a bit at consistency of country-/province-issues that I've seen already in the JHU-original datasets. Something like
Have been just peeking into the dataset, no rigorous protocol so far. If you don't think this is of concern for the real use of the dataset I'm fine with it, but I can also try to contribute a more detailed protocol of (possible/guesses) issues.