epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

Don't rely on exact column ordering when grouped data is involved #445

Closed DavisVaughan closed 2 years ago

DavisVaughan commented 2 years ago

We are planning on releasing tidyr 1.2.0 towards the end of this month.

We noticed in revdeps that this package was broken. An easy way to reproduce is to install the dev version of tidyr and run your tests:

      > test_check("covidregionaldata")
      ══ Failed tests ════════════════════════════════════════════════════════════════
      ── Failure (test-processing.R:102:3): complete_cumulative_columns works ────────
      colnames(actual_data) (`actual`) not equal to colnames(expected_data) (`expected`).

          actual                | expected                 
      [1] "level_1_region"      - "date"                [1]
      [2] "date"                - "level_1_region"      [2]
      [3] "level_1_region_code" | "level_1_region_code" [3]
      [4] "cases"               | "cases"               [4]
      [5] "cases_total"         | "cases_total"         [5]

      [ FAIL 1 | WARN 3 | SKIP 0 | PASS 269 ]
      Error: Test failures
      Execution halted

The problem comes down to the fact that you were expecting exactly the same column ordering between these two results. This has changed in the new version of tidyr. The problem really comes down to the fact that tidyr::complete() now works correctly for grouped data frames (it was a little broken before). One of the changes is that it previously would shift the columns you are completing on to the front of the result, but now it also shifts the grouping columns that you are completing "within" to the front as well. This results in the most intuitive order to read the resulting completed data frame in.

This PR just sorts the column names before checking equality.

This should work on both the current and development version of tidyr, so you should be able to go ahead and do a patch release. We would greatly appreciate if you could do this so we can release tidyr! Thank you!

DavisVaughan commented 2 years ago

We've also made a fix to complete() to ensure that it always works as expected with grouped data frames. One of the results of this fix is that you can no longer supply group variables to complete() (if you have a grouped data frame, complete() will work "within" each group so you shouldn't have access to them). See https://github.com/tidyverse/tidyr/pull/1300 for more details.

This seems to have also broken a few tests. In particular it seems related to https://github.com/epiforecasts/covidregionaldata/blob/b93e19dcb1e39b67828c3e4b6038cbf2b158ef1d/R/processing.R#L60 At some point in your tests you pass this a grouped data frame with level_1_region as a group column, and then you try and complete on that column too, which is not allowed / well-defined. It is possible that all you really wanted was to group by the level_ columns and then call complete(date = full_seq(data$date, period = 1)) (i.e. within each level group, complete the full range of dates from the original data set), but I am not entirely sure.

If you could take a look at that one too, I'd appreciate it. I spent a bit of time trying to figure out the best place to ungroup, but it was a little tough for me to know where the best place would be.