epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

Duplicate rows for some authorities in UK regional data #71

Closed TimTaylor closed 4 years ago

TimTaylor commented 4 years ago
library(covidregionaldata)
library(dplyr)

uk_regional <- get_regional_data("UK", include_level_2_regions = TRUE)
#>  |======================================================================| 100%

uk_regional[duplicated(uk_regional), ] %>% 
  group_by(authority) %>% 
  tally()

#> # A tibble: 4 x 2
#>   authority                 n
#>   <chr>                 <int>
#> 1 Dumfries and Galloway   230
#> 2 Fife                    230
#> 3 Highland                230
#> 4 Powys                   230
seabbs commented 4 years ago

Thanks for this @tjtnew,

Did you have luck determining if this was a feature of the data or due to our processing issues?

TimTaylor commented 4 years ago

Not looked at this since raising the issue.

rboyes commented 4 years ago

It looks like a processing issue from my inspection, inside get_authority_lookup_table. I think, e.g., the upper_tier_auth and ni_auth have the same region2 information in them, and when the rows are combined the region2 information then appears twice.

authority_lookup_table <- get_authority_lookup_table()

authority_lookup_table %>% group_by(region_level_2) %>% tally() %>% filter(n > 1)
# A tibble: 15 x 2
   region_level_2                           n
   <chr>                                <int>
 1 Antrim and Newtownabbey                  2
 2 Ards and North Down                      2
 3 Armagh City, Banbridge and Craigavon     2
 4 Belfast                                  2
 5 Causeway Coast and Glens                 2
 6 Derry City and Strabane                  2
 7 Dumfries and Galloway                    2
 8 Fermanagh and Omagh                      2
 9 Fife                                     2
10 Highland                                 2
11 Lisburn and Castlereagh                  2
12 Mid Ulster                               2
13 Mid and East Antrim                      2
14 Newry, Mourne and Down                   2
15 Powys                                    2

This duplication can be removed by going:

authority_lookup_table <- authority_lookup_table %>% dplyr::arrange(level_1_region_code) %>% dplyr::distinct(level_2_region_code, region_level_2, .keep_all = TRUE)

Note the arrange is required to ensure the NA level_1_region_codes are sorted to the bottom so the distinct step removes them.

Happy to submit this as a pull request if you like; please let me know.

kathsherratt commented 4 years ago

Thanks very much for looking at this and identifying exactly where the error is @rboyes . I hadn't checked back on this code in a while, great to have someone else look at it. This should now be fixed in master, using your code (plus a very slightly cleaner lookup process).

kathsherratt commented 4 years ago

@rboyes - I should have mentioned this earlier, we will add you as a package contributor (#83) unless you let us know otherwise. Thanks again