GoogleCloudPlatform / covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.
Apache License 2.0
471 stars 130 forks source link

Some entries in index.csv appear to be incorrect #156

Closed winwiz1 closed 4 years ago

winwiz1 commented 4 years ago

Hi,

I assume a Level 2 or Level 3 entry in index.csv refers to a subregion2_name or locality_name respectively - with both being located within a state/province L1 area denoted by subregion1_name. This led me to presume the index literal for L2 and L3 entries should be in a form L0_L1_L2/3.

I tried to verify this assumption but in README the the links schema documentation and data loading tutorial under the Notes about the data heading are broken. Is the assumption correct?

The folowing L2 and L3 indexes don't comply with the L0_L1_L2/3 format:

LY_NQ
LY_SR
LY_ST
UA_KBP
owahltinez commented 4 years ago

Thanks for reporting this! You are mentioning multiple issues here, let me try to answer them one at a time.

I assume a Level 2 or Level 3 entry in index.csv refers to a subregion2_name or locality_name respectively - with both being located within a state/province L1 area denoted by subregion1_name. This led me to presume the index literal for L2 and L3 entries should be in a form L0_L1_L2/3.

This is almost right. The only thing that is not true is that L3 is a "special cases" level. Most of the time it refers to a city, which is likely to be located within an L1 or L2 region, but that may not always be the case.

I tried to verify this assumption but in README the the links schema documentation and data loading tutorial under the Notes about the data heading are broken. Is the assumption correct?

We will fix the links. The link to the schema should point to this: https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-index.md.

The folowing L2 and L3 indexes don't comply with the L0_L1_L2/3 format:

The Libya subregions are indeed a bug, we will fix those. Unfortunately, UA_KBP appears to be a special case since it's a city but it's reported as admin level 1: https://en.wikipedia.org/wiki/ISO_3166-2:UA.

owahltinez commented 4 years ago

The pull request I'm about to submit will close this issue, but feel free to reopen if you have any more questions.

winwiz1 commented 4 years ago

Thank you for the clarification and for the fixes.

Unfortunately, UA_KBP appears to be a special case since it's a city but it's reported as admin level 1: https://en.wikipedia.org/wiki/ISO_3166-2:UA.

The ISO standard you referred to states the city is at the level 1, as mentioned above. This is correctly reflected in index.csv by the index UA_30 having aggregation_level=1. The index UA_KBP seems to refer to the same city with the same wikidata=Q1899 placing the city additionally at the level 3 and leading to the entries with duplicating case counts in epidemiology.csv:

2020-09-09,UA_30,310,4,84,,15821,250,4910,
2020-09-09,UA_KBP,310,4,84,,15821,250,4910,
owahltinez commented 4 years ago

That's correct, UA_30 is equivalent to UA_KBP. We're trying to indicate that UA_KBP is a city, whereas UA_30 is an admin level 1 region.

They are both the same, but if we omitted UA_KBP it would be hard to find for someone who is looking for cities. Whereas you can currently search for aggregation_level=3 and find cities from all around the world regardless of whether they are descendants of levels 0, 1 or 2.