epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

More additional tests on data in regional data #312

Open RichardMN opened 3 years ago

RichardMN commented 3 years ago

Implementation of more tests from #302, intended to be merged into #307

Checks that the number of regions at level 2 >= the number of regions at level 1 Checks that there is at most 1 level 1 region coded to NA Checks that there is at most 1 level 2 region coded to NA Checks that the number of level 1 regions is identical in download of level 1 and level 2 dat

This is a rough implementation.

It currently has a stand-alone wrapper to apply purrr because it only is run once for each country class, but then needs to be told what is the maximum level available from that country class. Notionally, this could allow us to go beyond level 2, but the code doesn't have that recursive flexibility yet. It may be possible to merge the wrapper with the existing file. I've removed the download option because these tests only make sense with download = TRUE.

The files and possibly the functions probably should be renamed.

It calls get_regional_data and I think it doing this since this is the "front end" for most users and there could still be glitches between the data class and the output of get_regional_data.

It's not quick.

RichardMN commented 3 years ago

It's now hard to see from the comparison against the older version of the branch what this changes, but the changes are fairly minor. I think it probably useful to have roughly three separate strands of tests (which may not all be framed as tests):

  1. unit tests - does the code do what we expect it to do under various (mostly canned) inputs, working mostly at small unit levels (we have this)
  2. data availability tests - does our system successfully extract "some data" from the sources (we have this and run it nightly and I like the dashboard)
  3. data sanity checks - does the data that our system pulls make some sort of sense, and has there possibly been some change to the underlying source which doesn't break the data availability test but may mean the information we are providing isn't reliable, whether for new data, or for older data, or for both

I think this PR and #307 are both trying to provide something like 3, and that it's useful to have. It's sort of a canary.

It's not necessary and we've done without it for a long time. I can tinker with my PR and am happy to try to work with @joseph-palmer on #307, but also content if the feeling right now is that this can be put aside in favour of other efforts.

github-actions[bot] commented 3 years ago

This PR has been flagged as stale due to lack of activity