UK Information is incorrect

GoogleCloudPlatform / covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.

Apache License 2.0

471 stars 130 forks source link

UK Information is incorrect #418

Open jamesrswift opened 3 years ago

jamesrswift commented 3 years ago

19/02/2021	United Kingdom	calculated total	4,105,399 (+1,447)	no data	no data	120,363 (+98)

Actual data for UK:

19/02/2021	United Kingdom	calculated total	4,095,269 (+12,027)	no data	no data	119.920 (+553)

jamesrswift commented 3 years ago

The rest of the data is also incorrect, however, I included only the most recent for brevity.

winwiz1 commented 3 years ago

The count in question 4,105,399 (+1,447) is taken from virusquery.com output:

Date	Country	State/Province	Confirmed	Recovered	Active	Deaths
2021-02-19	United Kingdom	England	3596965 (+1232)	no data	no data	106170 (+95)
2021-02-19	United Kingdom	Northern Ireland	110440 (+0)	no data	no data	2027 (+1)
2021-02-19	United Kingdom	Scotland	196642 (+214)	no data	no data	6945 (+1)
2021-02-19	United Kingdom	Wales	201352 (+1)	no data	no data	5221 (+1)
2021-02-19	United Kingdom	calculated total	4105399 (+1447)	no data	no data	120363 (+98)
2021-02-19	United Kingdom	total	4103952 (+0)	no data	no data	120363 (+98)
2021-02-18	United Kingdom	England	3595733	no data	no data	106075
2021-02-18	United Kingdom	Northern Ireland	110440	no data	no data	2026
2021-02-18	United Kingdom	Scotland	196428	no data	no data	6944
2021-02-18	United Kingdom	Wales	201351	no data	no data	5220
2021-02-18	United Kingdom	calculated total	4103952	no data	no data	120265
2021-02-18	United Kingdom	total	4103952	no data	no data	120265

'Calculated total' means the count has been calculated as the running sum of the State/Province counts shown. 'Total' means the country-wide total was taken from the dataset.

themonk911 commented 3 years ago

@winwiz1 From today's https://virusquery.com/ it looks like the countrywide total is updated later than the nationwide totals. e.g. 2021-02-19 now agrees (but 2021-02-20 does not).

Screenshot 2021-02-22 at 15 36 40

winwiz1 commented 3 years ago

@themonk911 Yes, this is what seems to be happening.

winwiz1 commented 3 years ago

Can I suggest to replace in Public Health England API calls cumCasesBySpecimenDate with cumCasesByPublishDate. The former is highly volatile (an API call made later can and frequently will produce a result different from API call made earlier for the same date and nation) whereas the latter is stable and is what various publications appear to refer to. The replacement will yield the calculated count of 4,095,269 cases dated 19 Feb and referred to by OP. And similar replacement for the new cases metric.

Would be good to get the previously collected cumCasesBySpecimenDate data replaced as well.

themonk911 commented 3 years ago

From https://coronavirus.data.gov.uk/details/developers-guide, they claim to not support publish date for regions, only for nations.

Some metrics are not available for specific areaType values. For instance, we have newCasesByPublishDate and cumCasesByPublishDate only available for areaType=nation but not for region,utla, or ltla. Conversely, we have newCasesBySpecimenDate and cumCasesBySpecimenDate available for region, utla, and ltla but not for nation.

So there will be a mismatch between region/nation, but given it appears to resolve within a day, I'm not convinced this is a huge issue.

winwiz1 commented 3 years ago

As of now the page https://coronavirus.data.gov.uk/details/cases shows: At the top:

People tested positive
Total
4,126,150

At the bottom:

Cases by area (whole pandemic)
United Kingdom 4,126,150

This latest count is already quoted by hundreds of sources: https://www.google.com/search?q=UK+COVID+cases+%224%2C126%2C150%22+site%3Auk and this is what users of the dataset expect to see. It comes from newCasesByPublishDate. Switching to this metric also resolves the issue with not havinng the count of 4,095,269 cases for 19 Feb correctly pointed to by OP as the expected count.

As for the availability of this data for various levels, the bottom heading "Cases by area (whole pandemic)" shows counts at different levels and my initial assumption would be that all this data is consistent e.g. comes from one API metric.

You can compare data for the cumCasesBySpecimenDate and cumCasesByPublishDate using link: https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&metric=cumCasesByPublishDateRate&metric=cumCasesBySpecimenDate&format=csv&release=2021-02-22

date,areaType,areaCode,areaName,cumCasesByPublishDate,cumCasesBySpecimenDate

"2021-02-22",nation,N92000002,Northern Ireland,111166,
"2021-02-21",nation,N92000002,Northern Ireland,110979,111166
"2021-02-20",nation,N92000002,Northern Ireland,110716,110985
"2021-02-19",nation,N92000002,Northern Ireland,110440,110747
"2021-02-18",nation,N92000002,Northern Ireland,110127,110437
"2021-02-17",nation,N92000002,Northern Ireland,109785,110144
"2021-02-16",nation,N92000002,Northern Ireland,109488,109800
"2021-02-15",nation,N92000002,Northern Ireland,109147,109476

"2021-02-22",nation,W92000004,Wales,202007,
"2021-02-21",nation,W92000004,Wales,201688,202007
"2021-02-20",nation,W92000004,Wales,201352,202006
"2021-02-19",nation,W92000004,Wales,200989,201858
"2021-02-18",nation,W92000004,Wales,200456,201592
"2021-02-17",nation,W92000004,Wales,200166,201256
"2021-02-16",nation,W92000004,Wales,199793,200884
"2021-02-15",nation,W92000004,Wales,199518,200444

"2021-02-22",nation,S92000003,Scotland,198184,
"2021-02-21",nation,S92000003,Scotland,197469,198183
"2021-02-20",nation,S92000003,Scotland,196642,197971
"2021-02-19",nation,S92000003,Scotland,195839,197372
"2021-02-18",nation,S92000003,Scotland,194954,196504
"2021-02-17",nation,S92000003,Scotland,194269,195629
"2021-02-16",nation,S92000003,Scotland,193148,194798
"2021-02-15",nation,S92000003,Scotland,192375,193889

"2021-02-22",nation,E92000001,England,3614793,
"2021-02-21",nation,E92000001,England,3605373,3614793
"2021-02-20",nation,E92000001,England,3596965,3613045
"2021-02-19",nation,E92000001,England,3588001,3607093
"2021-02-18",nation,E92000001,England,3577705,3598559
"2021-02-17",nation,E92000001,England,3566965,3588797
"2021-02-16",nation,E92000001,England,3556039,3578869
"2021-02-15",nation,E92000001,England,3546803,3568069

The difference is substantial, does not resolve within a day and depends on the specific date specified as a part of API call since cumCasesBySpecimenDate data is volatile and the past data does change - on contrary to cumCasesByPublishDate which produces stable counts.

themonk911 commented 3 years ago

I see. The problem at the moment is that we'd like to be consistent between the different levels. cumCasesByPublishDate is not available for L3 and Regions data.(e.g. https://api.coronavirus.data.gov.uk/v2/data?areaType=region&metric=cumCasesByPublishDate&metric=cumCasesBySpecimenDate&format=csv&release=2021-02-22) shows a blank column for cumCasesByPublishDate.

At one point we were using PublishDate for nations + UK, and SpecimenDate for the rest (L3 + regions), but then you have a different set of inconsistencies than we currently have. I'm not sure whether one is really better than the other, and we're limited by the data available to us. @owahltinez not sure whether you have thoughts on this matter?

themonk911 commented 3 years ago

correction: regions data has cumCasesByPublishDate for only the most recent date.

winwiz1 commented 3 years ago

cumCasesByPublishDate is not available for L3 and Regions data

Right, so there are issues with L3 and Regions as far as switching to cumCasesByPublishDate is concerned. I assume regions are NUTS regions. The issues seem to be solvable.

correction: regions data has cumCasesByPublishDate for only the most recent date.

To be more precise, an API call for a region will return the cumCasesByPublishDate data for the single date specified as a part of the API call URL. So getting both the latest and historical data is possible, however the latter requires multiple API calls (one call per calendar date) subject to throttling. Since cumCasesByPublishDate counts are stable, getting historical data can be done once.

Issue with L3. If I understand correctly, currently there is only one L3 geographical entity in UK: London. The API call https://api.coronavirus.data.gov.uk/v2/data?areaType=region&areaCode=E12000007&metric=cumCasesByPublishDate&format=csv&release=2021-02-22 returns data:

date,areaType,areaCode,areaName,cumCasesByPublishDate
"2021-02-22",region,E12000007,London,691393

Multiple calls like that can be used as described above to get London historical data.

Issue with NUTS regions.

Regions data.(e.g. https://api.coronavirus.data.gov.uk/v2/data?areaType=region&metric=cumCasesByPublishDate&metric=cumCasesBySpecimenDate&format=csv&release=2021-02-22) shows a blank column for cumCasesByPublishDate.

The list of regions returned includes London and cumCasesByPublishDate data can be queried for each region like for London using the same technique.

The issue is however that those regions seem to be different from NUTS regions (including these in index.csv). Glancing over I can see that some index.csv entries like nuts/UKC11 Hartlepool, nuts/UKD41 Blackburn with Darwen are not on the list of regions returned by this API. The regions returned by this API can be found on the UK government page under Area name when Area type is set to 'Region'. The locations named similarly to NUTS regions (I don't know if those have the same boundaries as NUTS regions) e.g. Blackburn with Darwen are utla and can be found on the same page under Area name when Area type is set to 'Upper Tier Local Authority'.

Again, for utla like Blackburn with Darwen the same multiple API call technique can be used - provided you are sure utla has the same boundaries as the corresponding NUTS region:

https://api.coronavirus.data.gov.uk/v2/data?areaType=utla&areaCode=E06000008&metric=cumCasesByPublishDate&format=csv&release=2020-12-12
Output:
date,areaType,areaCode,areaName,cumCasesByPublishDate
"2020-12-12",utla,E06000008,Blackburn with Darwen,10313

Root cause of the issue with NUTS regions.

On the UK government page there is a link to the document Hierarchical Representation of UK Statistical Geographies (December 2020). It tells us there are eight UK Statistical Geographies as of December 2020, each with its own hierarchy.

It would be reasonable to assume Public Health England focuses on the Health Geography, its hierarchy and ensures the API supports it. Whereas NUTS regions belong to Eurostat Geography. It would appear blending both geographies into one dataset was based on best intentions to accomodate a request from a researcher but it created issues down the track since PHE caters mostly for Health Geography.

owahltinez commented 3 years ago

@owahltinez not sure whether you have thoughts on this matter?

I don't have a very strong preference here. I recall going back and forth about this and eventually settling on our current metric, it was an informed decision and not arbitrary. It seems that the difference between the two API calls for the larger regions is <1% so either way it wouldn't be the end of the world.

If users expect an exact count and the inconsistency across the different levels is such a small difference, I wouldn't oppose changing the metric used to match what is being reported elsewhere (while keeping the more reliable metric for smaller subregions).

That said, @themonk911 is the local expert and has been working the longest with this data so I'll defer to their decision.

It would be reasonable to assume Public Health England focuses on the Health Geography, its hierarchy and ensures the API supports it. Whereas NUTS regions belong to Eurostat Geography. It would appear blending both geographies into one dataset was based on best intentions to accomodate a request from a researcher but it created issues down the track since PHE caters mostly for Health Geography.

Blending data from multiple datasets is the core value-proposition of our project. We harmonize geographical locations as much as we can so the data from different sources can be merged seamlessly. Sometimes, a few regions are present in one system but not another, the UK has actually been the most challenging to work with because of the many different ways there are to divide the country into smaller admin regions.

To the best of my knowledge, the NUTS regions from the UK that we report data for have identical boundaries. Using the same example of Blackburn with Darwen, you can see in the Wikidata page that it has multiple identifiers associated with it — one is NUTS and another UTLA.

In some cases we only have the name of a region to go by, for example the Google Mobility Reports. So the matching of regions is not an exact science but we find it close enough to be useful — although the mobility reports recently started publishing an identifier we can use to disambiguate, so this will be a smaller problem in the future.

winwiz1 commented 3 years ago

Blending data from multiple datasets is the core value-proposition of our project.

Sure. However the value derived from a particular blending depends on factors like correctness and completeness of the implementation. Once all that is factored in, the value needs to be balanced against the cost of functional regressions it caused if any.

I understand the implementation of NUTS regions initially caused undesirable inconsistency (concurrent use of both metrics) and later contributed to switching to cumCasesBySpecimenDate only to avoid mixing it up with cumCasesByPublishDate. In my view it comes at the cost of causing user confusion and making the data not quite meaningful to general public (as far as UK is concerned). The statistical research that uses cumCasesByPublishDate data would be affected as well. Hopefully this can be rectified by using the suggested API call pattern (e.g. one call per calendar date) and by switching uniformly to cumCasesByPublishDate.

As a side note, significant research value of cumCasesBySpecimenDate is related to the ability of a researcher to see how the data for this metric was changing over the time for the same calendar date (it facilitates insight into updates during subsequent days reflecting specimen processing delays etc.). This is what PHE API provides by allowing to query not only the initial value but also the later updates. Even though the dataset doesn't have this capacity, I still think it would be a good idea to store the cumCasesBySpecimenDate data. I was thinking about the L3 level mentioned here. Depending on a country, it could be a cluster of housing estates, a major hospital complex, a metropolis like London or have no meaning at all - please correct me if I'm wrong. Maybe in this spirit the dataset could additionally have a ‘total/cumulative auxiliary case count’ that would store SpecimenDate for UK and something else or nothing for other countries.

To the best of my knowledge, the NUTS regions from the UK that we report data for have identical boundaries. Using the same example of Blackburn with Darwen, you can see in the Wikidata page that it has multiple identifiers associated with it — one is NUTS and another UTLA.

Correct for this particular NUTS level 3 region named after a single local authority. Looking at the Wiki page we can see in the table that names of some NUTS 3 regions include more than one local authority. For example, the first region is UKC11 “Hartlepool and Stockton-on-Tees”. Searching index.csv for UKC11 and then getting epidemiology data for this region yields case counts identical to what an API call (based on the latest date and cumCasesBySpecimenDate metric) returns for Hartlepool. The data for Stockton-on-Tees appears to be missing. Other NUTS 3 regions with the word ‘and’ in their names would have the same issue because the implementation doesn’t map each NUTS 3 region to one or more local authorities.

There are 174 NUTS 3 regions in UK. The dataset contains index entries for 49 level 3 regions. Data for each region can be collected either by direct API call (in cases when there is one-to-one match between a region and a local authority) or by summing up the counts provided by the relevant local authorities.

Looks like UK government renamed NUTS regions to ITLs. So this area could be looked at in some future – on contrary to fixing the metrics which is a more urgent issue.

owahltinez commented 3 years ago

In my view it comes at the cost of causing user confusion and making the data not quite meaningful to general public (as far as UK is concerned). The statistical research that uses cumCasesByPublishDate data would be affected as well.

It may be confusing if you're expecting the counts to exactly match other sources, but since the difference is <1% I don't think it will affect research significantly. We determined that the currently used metric was more accurate and consistent across aggregation levels, but based on your feedback we are evaluating switching to the metric that matches other data sources.

This is what PHE API provides by allowing to query not only the initial value but also the later updates. Even though the dataset doesn't have this capacity, I still think it would be a good idea to store the cumCasesBySpecimenDate data.

If I understand what you are saying correctly, this is technically possible to do with our dataset but sadly very difficult. You can get a "snapshot" of what our dataset looked like at any arbitrary point in time by accessing the object versioning of the file. I hope to make some time in the future to provide step-by-step examples of how to do this...

Maybe in this spirit the dataset could additionally have a ‘total/cumulative auxiliary case count’ that would store SpecimenDate for UK and something else or nothing for other countries.

This is not a bad idea, but it would incur a huge penalty in the total file size since it would be an empty column for nearly all rows. I would much rather choose one metric or the other.

For example, the first region is UKC11 “Hartlepool and Stockton-on-Tees”. Searching index.csv for UKC11 and then getting epidemiology data for this region yields case counts identical to what an API call (based on the latest date and cumCasesBySpecimenDate metric) returns for Hartlepool. The data for Stockton-on-Tees appears to be missing. Other NUTS 3 regions with the word ‘and’ in their names would have the same issue because the implementation doesn’t map each NUTS 3 region to one or more local authorities.

This sounds like a bug in our mapping of regions from NUTS 3 system to ours. This is not done automatically based on the region name, so a region having the word 'and' is not the root cause (although it probably makes it more likely that we got confused when mapping the regions). In this case, fortunately, it's only a datacommons ID issue since the mapping is actually blank

There are 174 NUTS 3 regions in UK. The dataset contains index entries for 49 level 3 regions. Data for each region can be collected either by direct API call (in cases when there is one-to-one match between a region and a local authority) or by summing up the counts provided by the relevant local authorities.

The choice of which UK regions to use for data reporting was made based on what epidemiological data was available nearly a year ago. It seems now many more regions are covered so we would probably make different choices today. I believe there is an ongoing effort to include more of the newly available regions in our dataset as part of the "catch-all" aggregation level 3, but I don't know what the timeline of that is nor if they correspond to the NUTS3 admin breakdown or something else entirely.