I have been iterating and releasing a free app for Splunk Enterprise (see links at the end)
And along the way I've encountered a number of issues in the data that, while very easy to fix on the fly in the app, I thought I'd pass along, so that they could potentially be fixed in the underlying daily reports, so that other people wouldn't have to do the same fixes/normalizations.
March 23rd lists French Polynesia as having 19,874 Confirmed cases and 860 Deaths. As this is very much an outlier, it seems to be a mistake.
If there is no breakdown by Province/State, then the Province/State simply repeats the Country/Region value, and this is quite reasonable. However this convention is broken for Austria, Iraq and Lebanon which all list a Province/state of "None"
in the earlier phase when US data was broken down with Province/State values that were counties, it would be a nice convenience if you could use a script to go back and propagated these values over to the new County field "Admin2". In Canada incidentally the same thing could be done with ", QC", ", Alberta", and "*, Ontario", all of which are present in the daily files
Province_State of jackson County, OR --- there is a troublesome trailing space character on all values.
Similarly, Province_State of Montreal, QC has a leading space on all values.
on March 08 you have a "Washington, D.C. and a "District of Columbia". The rest of the data only has "District of Columbia"
for the US Virgin Islands, you have 3 different entries and different days have different ones. The three distinct values are: "United States Virgin Islands", "Virgin Islands, U.S." and "Virgin Islands".
on March 14 French Guiana is misspelled as "Fench Guiana"
China is split between Country_Region="China" and Country_Region="Mainland China". Most days have either one or the other but there are some days that have both (feb/23 and mar/8)
Hong Kong appears in three ways. First as Country_Region="Hong Kong" and Province_State="Hong Kong", until March 10 when it becomes Country_Region="Hong Kong SAR" and Province_State = "Hong Kong"
then March 11 it becomes Country_Region="China" and Province_State="Hong Kong". Which of the second two is correct is perhaps a tricky political question but it would be better if they were consistent. Macau SAR had the same issue at one point but that seems to have been fixed.
South Korea appears in three ways - "Korea, South" until March 10 and then "South Korea" and "Republic of Korea" - (see March 10, March 11 through March 30)
The United Kingdom appears with Country/Region values of both "UK" and "United Kingdom", being "UK" until March 11 when it switches over to "United Kingdom"
Vietnam appears as both "Vietnam" and "Viet Nam" (see March 10)
Taiwan appears as both "Taiwan" and "Taiwan*". It switches to the latter on March 10
Russia is listed as "Russia" except on March 10 when it is "Russian Federation"
Iran is listed as "Iran" except on March 10 when it is "Iran (Islamic Republic of)"
Moldova is listed as "Moldova" except on March 10 when it is "Republic of Moldova"
Ireland is listed as "Ireland" except on March 8 when it is "Republic of Ireland"
There's a Country/Region value of "North Ireland" but only on Feb 28.
Czech Republic is listed as "Czech Republic" until March 10 when it becomes "Czechia"
Vatican City is listed as "Holy See" except on March 06 it is "Vatican City"
@mealy i'm a daily user myself. you might look at the file fixups.sed in my repository, in case there are some other useful tidbits to massage the daily reports. cheers.
I have been iterating and releasing a free app for Splunk Enterprise (see links at the end) And along the way I've encountered a number of issues in the data that, while very easy to fix on the fly in the app, I thought I'd pass along, so that they could potentially be fixed in the underlying daily reports, so that other people wouldn't have to do the same fixes/normalizations.
March 23rd lists French Polynesia as having 19,874 Confirmed cases and 860 Deaths. As this is very much an outlier, it seems to be a mistake.
If there is no breakdown by Province/State, then the Province/State simply repeats the Country/Region value, and this is quite reasonable. However this convention is broken for Austria, Iraq and Lebanon which all list a Province/state of "None"
in the earlier phase when US data was broken down with Province/State values that were counties, it would be a nice convenience if you could use a script to go back and propagated these values over to the new County field "Admin2". In Canada incidentally the same thing could be done with ", QC", ", Alberta", and "*, Ontario", all of which are present in the daily files
Province_State of jackson County, OR --- there is a troublesome trailing space character on all values.
Similarly, Province_State of Montreal, QC has a leading space on all values.
on March 08 you have a "Washington, D.C. and a "District of Columbia". The rest of the data only has "District of Columbia"
for the US Virgin Islands, you have 3 different entries and different days have different ones. The three distinct values are: "United States Virgin Islands", "Virgin Islands, U.S." and "Virgin Islands".
on March 14 French Guiana is misspelled as "Fench Guiana"
China is split between Country_Region="China" and Country_Region="Mainland China". Most days have either one or the other but there are some days that have both (feb/23 and mar/8)
Hong Kong appears in three ways. First as Country_Region="Hong Kong" and Province_State="Hong Kong", until March 10 when it becomes Country_Region="Hong Kong SAR" and Province_State = "Hong Kong" then March 11 it becomes Country_Region="China" and Province_State="Hong Kong". Which of the second two is correct is perhaps a tricky political question but it would be better if they were consistent. Macau SAR had the same issue at one point but that seems to have been fixed.
South Korea appears in three ways - "Korea, South" until March 10 and then "South Korea" and "Republic of Korea" - (see March 10, March 11 through March 30)
The United Kingdom appears with Country/Region values of both "UK" and "United Kingdom", being "UK" until March 11 when it switches over to "United Kingdom"
Vietnam appears as both "Vietnam" and "Viet Nam" (see March 10)
Taiwan appears as both "Taiwan" and "Taiwan*". It switches to the latter on March 10
Russia is listed as "Russia" except on March 10 when it is "Russian Federation"
Iran is listed as "Iran" except on March 10 when it is "Iran (Islamic Republic of)"
Moldova is listed as "Moldova" except on March 10 when it is "Republic of Moldova"
Ireland is listed as "Ireland" except on March 8 when it is "Republic of Ireland"
There's a Country/Region value of "North Ireland" but only on Feb 28.
Czech Republic is listed as "Czech Republic" until March 10 when it becomes "Czechia"
Vatican City is listed as "Holy See" except on March 06 it is "Vatican City"
LINKS: If you have Splunk already the app is just a few minutes to install from Splunkbase https://splunkbase.splunk.com/app/4925/
You can also get it as a tar.gz from our site: https://sideviewapps.com/apps/covid19-reporting/
and here are our release notes https://sideviewapps.com/apps/covid19-reporting/release-notes/