epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

Extend data processing #322

Closed joseph-palmer closed 3 years ago

joseph-palmer commented 3 years ago

This PR makes data processing more modular by allowing users to 'switch off' running individual functions used during processing. The best case for doing this would be if the user wanted to keep negative values, rather than replace them with zeros. Currently this PR supports turning off other functions (calculate_columns_from_existing_data for example), but doing this may cause errors.

This works by the user passing a list of processing options (process_options) at initalisation, this must be a list of functions, e.g. process_options = list("set_negative_values_to_zero" = FALSE). At initalisation format_process_options takes these arguments, does checks on them and alters a list of known functions to remove those which are false. In processing only the list of function names are filtered to those that are TRUE and called, thus removeing those marked FALSE.

Switching off procesing functions at the minute only really works for 'set_negative_values_to_zero', other cause errors as what they add gets used further down the line. Nevertheless, using this method is scalable (and we can remove the others from the options to change) should we add any other procesing steps in which a user may want to play with.

TODO

closes #317

github-actions[bot] commented 3 years ago

👋 Thanks for opening this pull request! Can you please run through the following checklist before requesting review (ticking as complete or if not relevant).

Thank you again for the contribution. If making large scale changes consider using our pre-commit hooks (see the contributing guide) to more easily comply with our guidelines.

seabbs commented 3 years ago

So I was thinking here we would have a core set of options we always run and then an optional list you can alter?

So in my mind DataClass would have a field like

process_fns = c(replace_zero_with_na)

which can get changed at initialisation or in a certain class.

Then similar to how you have things implemented we would have:

process_internal <- function(process_fns).{
  processed <- default_fns(clean)
  processed <- optional_fns(processed, process_fns = process_fns)
}

Where optional fns is:

optional_fns <- fuction(processed, process_fns) {

for (i in seq_along(process_fns)) {
processed <- do.call(process_fns[I], processed)
}
return(processed)
}
joseph-palmer commented 3 years ago

I have changed the way this works and the tests. It now has a field in DataClass which storees a lit of functions passed in at initalization, with set_negative_values_to_zero as a default item in the list. Users can pass any function to work on the data and they will be applied after the default functions have worked.

Need to think about where they go with grouping the data, at them moment they are applied to the grouped data but this can be modified. I will look over the process internal function and see if I can reformat it to work faster and be more verbose.

seabbs commented 3 years ago

Their were some merge conflicts not showing as merge conflicts so just jumped it to fix.

In doing so saw maybe a few edge cases that were a little strange (I don't think you could empty the default of process_fns?) + expanded testing a little. @joseph-palmer can you check and see if you are happy?