UK data processing issue at level 2

epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.

https://epiforecasts.io/covidregionaldata/

Other

37 stars 18 forks source link

UK data processing issue at level 2 #288

Closed seabbs closed 3 years ago

seabbs commented 3 years ago

Whilst tests are passing okay updating the README gives an empty plot and a very slow download time with the following code. This indicates a potential issue that needs investigation.

We can also explore data for level 2 regions (here Upper-tier local authorities),

uk_nots_2 <- get_regional_data(country = "UK", level = "2", verbose = FALSE)
uk_nots_2

now as an example we can plot cases in the East Midlands,

uk_nots_2 %>%
  filter(region %in% c("East Midlands")) %>%
  ggplot() +
  aes(x = date, y = cases_new, col = authority) +
  geom_line(alpha = 0.4) +
  labs(x = "Date", y = "Reported Covid-19 cases") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "top") +
  guides(col = guide_legend(title = "Authority"))

Level 2 data is only available for some countries, see get_available_datasets for supported nations.

seabbs commented 3 years ago

Looking at this it looks like there has been a breaking data source change:

r$> unlist(uk_nots_2[1,])                                                            
                        date                       region 
                     "18291"              "East Midlands" 
                  iso_3166_2                    authority 
                 "E12000004"                      "Derby" 
             ons_region_code                    cases_new 
                 "E06000015"                           NA 
                 cases_total                   deaths_new 
                          NA                           NA 
                deaths_total                recovered_new 
                          NA                           NA 
             recovered_total                     hosp_new 
                          NA                           NA 
                  hosp_total                   tested_new 
                          NA                           NA 
                tested_total                     areaType 
                          NA                           NA 
       cumCasesByPublishDate       cumCasesBySpecimenDate 
                          NA                           NA 
       newCasesByPublishDate       newCasesBySpecimenDate 
                          NA                           NA 
  cumDeaths28DaysByDeathDate cumDeaths28DaysByPublishDate 
                          NA                           NA 
  newDeaths28DaysByDeathDate newDeaths28DaysByPublishDate 
                          NA                           NA

This clearly indicates a gap in tests as all data entries being NA should really be detected as an error! We should probably test to see if there is some data in either cases or deaths new or total. If there is non then throw an error.

joseph-palmer commented 3 years ago

I may have missunderstood the problem here, but when I run

uk<- get_regional_data(country = "UK", level = "2", verbose = FALSE)
uk %>% group_by(authority) %>% summarise(not_na = length(which(!is.na(cases_new)))) %>% filter(not_na < 1)

It returns a tibble with no rows, indicating there are no authorities where there are just NA's, so I can't see that data is coming in differently.

I can get the plot to work by removing scale_y_continuous(labels = comma) + with this line I get the error "Error: Breaks and labels are different lengths" and it produces a blank plot, once removed the plot renders fine. Looks like the most cases introduced on a single day is 951.

seabbs commented 3 years ago

Hmm how odd. Maybe it is fine then and I just imagined it/am an idiot 😆 .. I think it would be good to insert a test if we don't already have one testing for not all NA data.

something like:

expect_true(nrow(
   dt %>%
   filter(!is.na(cases_new))) > 0)

joseph-palmer commented 3 years ago

Cool, I will create a PR for a test