Script tool for correcting province/country notation errors/ambiguities

gottfriedhelms commented 4 years ago

Due to the immense data inconsistencies on referencing [country]'s and/or [province]'s I've made a scripting tool for easy defining corrections on the fields of the JHU-daily files. If I see misspellings in the [country] or [province] fields I can simply add the reference and a correction into the script and re-work the combined JHU-file with the updated script into a tsv-file (readable for instance directly by msaccess, excel).
"commands" in the scriptfile are simple lists:

// Commands for redefinition/standardizing of Country/province-notations
//       Syntax-help at end of file 
replace f1    // delete entries for province/state (=f1)
              ("None","Bavaria" = "")
         .
replace f2     // standardize entries for Country/region  (=f2)
           ("Mainland China","*Hong Kong*","Macao SAR","Macau" = "China")
           ("Viet Nam" = "Vietnam")
           ("Czech Republic"="Czechia" )
           ("Republic of Ireland" = "Ireland")
           ("Republic of Korea","Korea, South" = "South Korea")
           ("Republic of Moldova" = "Moldova")
           ("Cabo Verde" = "Cape Verde")  // don't know whether this should be done
           ("*Gambia*" = "Gambia")            // standardizes "The Gambia" and "Gambia, The"
           ("Holy See" = "Vatican City")
           ("Iran*" = "Iran")                 // delete "islamic republic"
           ("Russia*" = "Russia")             // delete "federation"
           ("occupi*" = "Palestine")          // delete "occupied..."
           ("*Bahamas*" = "Bahamas")          // "The Bahamas", "Bahamas, The"
  .
 // working at two fields at once: AND condition for pattern-testing, moving from one to the other field
replace f1;f2    // move entries for province/state (=f1) into [country] (=f2), nullify [province]
              ("Denmark","France","Netherlands" = "";$)
             .
replace f1;f2      // standardize entries for province (=f1) and for country (=f2) depending on this
            ("Cruise Ship","Diamond Princess" = "Diamond Princess cruise ship";"(Others)")
            ("Grand Princess" = "Grand Princess Cruise Ship";"(Others)")
            .

The basic idea was for my own needs, so it is incorporated in my translation-tool from JHU-files into (readable) tsv-files (with simpler quote-rules for string fields).
So far my reformulating-script is made on observations on typos/errors/mislocation up to the JHU-03-25.csv file. If someone is interested to use this, I make it available for everyone - it is a Windows/Delphi32-application, and I think of it as a free tool.
After I have the scripting tool so far, I've more ideas how to evolve, but I am interested in exchange with possible users (and of course have to overcome the experimental- and alpha phase...).
The cool idea with this is, that the script can be refined in a collaborative manner (as well as I can expand the script-language and -concept as needed).

A inspection tool (in msaccess) helps to find/locate/correct inconsistencies whose resolve can then be incorporated into the script. See for instance the desktop at the moment where I inspect the dtat-check Daatacheck for the entries for Canada and the changes of naming the provinces from day-to-day. The province-names are adapted by the script already, but we see, that the use of "Alberta" "Calgary, Alberta" and "Edmonton, Alberta" is inconsistent (the same with "Ontario" and so on). To formulate some new script-command for the resolve of this it helps that I appended the original field-contents before correcting at each record. The datafield "Filenr" refers to the daily JHU-file and gives sorting order and with the "Last update"-information helps to identify doublettes.

At the end of this post I've attached the current state of the script. For better readability all comments may be removed (comment: from "//" towards the end-of-line can be deleted)

I'm new to github and don't know about good methods of communication here. You can always use my email helms (at) uni-kassel.de

current scriptfile recode_seqfile_script.txt

cipriancraciun commented 4 years ago

Thank you for providing your mapping script. This was one of the main reasons I have created my own scripts to derive the JHU dataset.

At the moment I have already built into my own scripts mappings like this, based on the following workflow:

based on the country data from https://github.com/mledoze/countries (which provides besides official and alternative country names, also translations and the like) I have built a look-up table;
I augment that look-up table by some manual mappings found (for now) in https://github.com/cipriancraciun/covid19-datasets/blob/master/sources/facts-imports-countries-by-alias.jq
and before I do the actual lookup I also do some sanitization before hand: https://github.com/cipriancraciun/covid19-datasets/blob/master/sources/generic-imports-parse-locations.jq

I have similarly done the same for the US states and US counties.

For the other countries provinces, I just ignore that label (I still do provide it in the dataset). (However at a first look they seem OK.)

gottfriedhelms commented 4 years ago

Am 28.03.2020 um 20:26 schrieb Ciprian Dorin Craciun:

Thank you for providing your mapping script. This was one of the main reasons I have created my own scripts to derive the JHU dataset.

Hi Ciprian -

thanks for your reply. Wow, that has already been much work, it looks much deeper and much more reliable than my scripts... It's night here at the moment and I'll not continue long for now, tomorrow I'll be working a bit more on my tool. On the other hand, this jason-based processes that you point me to might eventually the more flexible solution.

My idea is again: can the script be curated via community-activity? That would be really good - we could have an half-automated translator/bug- remover/filter for the JHU-dataset... I'm curious...

For the moment I like much the simpliness of my scripts (maybe it shall come out to be too limited, though)

I'll look at JHU's and your Github-space more regularly from now -

see you -

Gottfried Helms

cipriancraciun commented 4 years ago

Of course we could easily manage such a lookup table via either a file in this repository, another dedicated repository, a shared Google Spreadsheet, etc.

However at the moment, after applying only the few transformations I've pointed above, I now get 0 inconsistencies. (I have included in my automated process warnings when a country name or US county / state can not be found.)

Therefore at the moment, I think a simple issue on this repository should be enough.

BTW, a similar approach of using community driven effort, was discussed on the JHU repository regarding patching any inconsistencies in the data itself:

https://github.com/CSSEGISandData/COVID-19/issues/1697#issuecomment-605180014

cipriancraciun commented 4 years ago

Given that we've started a new issue on the same topic (#7) I'll close this one and move the activity to the other one.

cipriancraciun / covid19-datasets

Script tool for correcting province/country notation errors/ambiguities #6