Closed gottfriedhelms closed 4 years ago
Thank you for providing your mapping script. This was one of the main reasons I have created my own scripts to derive the JHU dataset.
At the moment I have already built into my own scripts mappings like this, based on the following workflow:
I have similarly done the same for the US states and US counties.
For the other countries provinces, I just ignore that label (I still do provide it in the dataset). (However at a first look they seem OK.)
Am 28.03.2020 um 20:26 schrieb Ciprian Dorin Craciun:
Thank you for providing your mapping script. This was one of the main reasons I have created my own scripts to derive the JHU dataset.
Hi Ciprian -
thanks for your reply. Wow, that has already been much work, it looks much deeper and much more reliable than my scripts... It's night here at the moment and I'll not continue long for now, tomorrow I'll be working a bit more on my tool. On the other hand, this jason-based processes that you point me to might eventually the more flexible solution.
My idea is again: can the script be curated via community-activity? That would be really good - we could have an half-automated translator/bug- remover/filter for the JHU-dataset... I'm curious...
For the moment I like much the simpliness of my scripts (maybe it shall come out to be too limited, though)
I'll look at JHU's and your Github-space more regularly from now -
see you -
Gottfried Helms
Of course we could easily manage such a lookup table via either a file in this repository, another dedicated repository, a shared Google Spreadsheet, etc.
However at the moment, after applying only the few transformations I've pointed above, I now get 0
inconsistencies. (I have included in my automated process warnings when a country name or US county / state can not be found.)
Therefore at the moment, I think a simple issue on this repository should be enough.
BTW, a similar approach of using community driven effort, was discussed on the JHU repository regarding patching any inconsistencies in the data itself:
Given that we've started a new issue on the same topic (#7) I'll close this one and move the activity to the other one.
Due to the immense data inconsistencies on referencing [country]'s and/or [province]'s I've made a scripting tool for easy defining corrections on the fields of the JHU-daily files. If I see misspellings in the [country] or [province] fields I can simply add the reference and a correction into the script and re-work the combined JHU-file with the updated script into a tsv-file (readable for instance directly by msaccess, excel).
"commands" in the scriptfile are simple lists:
The basic idea was for my own needs, so it is incorporated in my translation-tool from JHU-files into (readable) tsv-files (with simpler quote-rules for string fields).
So far my reformulating-script is made on observations on typos/errors/mislocation up to the JHU-03-25.csv file. If someone is interested to use this, I make it available for everyone - it is a Windows/Delphi32-application, and I think of it as a free tool.
After I have the scripting tool so far, I've more ideas how to evolve, but I am interested in exchange with possible users (and of course have to overcome the experimental- and alpha phase...).
The cool idea with this is, that the script can be refined in a collaborative manner (as well as I can expand the script-language and -concept as needed).
A inspection tool (in msaccess) helps to find/locate/correct inconsistencies whose resolve can then be incorporated into the script. See for instance the desktop at the moment where I inspect the dtat-check for the entries for Canada and the changes of naming the provinces from day-to-day. The province-names are adapted by the script already, but we see, that the use of "Alberta" "Calgary, Alberta" and "Edmonton, Alberta" is inconsistent (the same with "Ontario" and so on). To formulate some new script-command for the resolve of this it helps that I appended the original field-contents before correcting at each record. The datafield "Filenr" refers to the daily JHU-file and gives sorting order and with the "Last update"-information helps to identify doublettes.
At the end of this post I've attached the current state of the script. For better readability all comments may be removed (comment: from "//" towards the end-of-line can be deleted)
I'm new to github and don't know about good methods of communication here. You can always use my email helms (at) uni-kassel.de
current scriptfile recode_seqfile_script.txt