Extend data processing - Githubissues

joseph-palmer commented 3 years ago

This PR makes data processing more modular by allowing users to 'switch off' running individual functions used during processing. The best case for doing this would be if the user wanted to keep negative values, rather than replace them with zeros. Currently this PR supports turning off other functions (calculate_columns_from_existing_data for example), but doing this may cause errors.

This works by the user passing a list of processing options (process_options) at initalisation, this must be a list of functions, e.g. process_options = list("set_negative_values_to_zero" = FALSE). At initalisation format_process_options takes these arguments, does checks on them and alters a list of known functions to remove those which are false. In processing only the list of function names are filtered to those that are TRUE and called, thus removeing those marked FALSE.

Switching off procesing functions at the minute only really works for 'set_negative_values_to_zero', other cause errors as what they add gets used further down the line. Nevertheless, using this method is scalable (and we can remove the others from the options to change) should we add any other procesing steps in which a user may want to play with.

TODO

It might be worth calling format_process_options within the processing step as at the moment a user could pass arguments after initalisation by replacing self$process_options

closes #317

github-actions[bot] commented 3 years ago

👋 Thanks for opening this pull request! Can you please run through the following checklist before requesting review (ticking as complete or if not relevant).

[x] Read our contribution guidelines if you have not already done so.
- [x] If you have altered an existing class please run the tests locally (using devtools::load_all(); devtools::test()) first setting options(testDownload=TRUE, testSource=class-name) and report your findings.
- [x] If you have added a new data class please run the tests locally for that class (using devtools::load_all(); devtools::test()).
- [x] Check your code passes our CI checks and review any style and code coverage warnings.
- [x] Comment with details if unable to get our CI checks to pass or unable to remove all warnings.

Thank you again for the contribution. If making large scale changes consider using our pre-commit hooks (see the contributing guide) to more easily comply with our guidelines.

seabbs commented 3 years ago

So I was thinking here we would have a core set of options we always run and then an optional list you can alter?

So in my mind DataClass would have a field like

process_fns = c(replace_zero_with_na)

which can get changed at initialisation or in a certain class.

Then similar to how you have things implemented we would have:

process_internal <- function(process_fns).{
  processed <- default_fns(clean)
  processed <- optional_fns(processed, process_fns = process_fns)
}

Where optional fns is:

optional_fns <- fuction(processed, process_fns) {

for (i in seq_along(process_fns)) {
processed <- do.call(process_fns[I], processed)
}
return(processed)
}

joseph-palmer commented 3 years ago

I have changed the way this works and the tests. It now has a field in DataClass which storees a lit of functions passed in at initalization, with set_negative_values_to_zero as a default item in the list. Users can pass any function to work on the data and they will be applied after the default functions have worked.

Need to think about where they go with grouping the data, at them moment they are applied to the grouped data but this can be modified. I will look over the process internal function and see if I can reformat it to work faster and be more verbose.

seabbs commented 3 years ago

Their were some merge conflicts not showing as merge conflicts so just jumped it to fix.

In doing so saw maybe a few edge cases that were a little strange (I don't think you could empty the default of process_fns?) + expanded testing a little. @joseph-palmer can you check and see if you are happy?

epiforecasts / covidregionaldata

Extend data processing #322