epiverse-trace / episoap

[Not published - under active development] A Store of Outbreak Analytics Pipelines Provided as Rmarkdown Report Templates
https://epiverse-trace.github.io/episoap/
Other
4 stars 2 forks source link

Data preparation for episoap pipelines #123

Open CarmenTamayo opened 5 months ago

CarmenTamayo commented 5 months ago

Currently the transmissibility pipeline starts with aggregated data, and the severity pipeline includes options for both individual and aggregated data.

After some discussions, we've agreed that having a previous module of the pipeline, which is dedicated to data cleaning and formatting would be beneficial so that users' data can be prepared for usage in either of the pipelines (and in future ones as well). This would also mean that the structure of the pipeline modules follow the structure of the pipeline map for Epiverse (early-middle-late tasks)

Issue #89 is related to this

@Bisaloo do you agree with this approach, or would you rather including data cleaning and formatting steps to each of the report templates separately?

CarmenTamayo commented 5 months ago

@chartgerink this might be a good discussion to participate in to get familiar with {episoap} and the rationale to build the package, if you are interested 😃

CarmenTamayo commented 5 months ago

Related to this, {linelist} is currently used in the report template to tag the data- however this isn't done directly, but objects are created with the names of the dataset columns, which are different to the tags used by {libelist}- see below

date_var <- "date" group_var <- "region" count_var <- "n"

dat <- dat_raw %>% make_linelist( date_admission = date_var, location = group_var, counts = count_var, allow_extra = TRUE )

I find this a bit convoluted, and later on in the report some of the names of the objects and data columns are used arbitrarily (at least to my knowledge) which can result in errors (see issue #117

Is there a best practice when it comes to naming variables? ie using objects vs tags vs column names?

chartgerink commented 5 months ago

Thanks for tagging me for this discussion @CarmenTamayo 😊

Data preparation

If I understand correctly, I think this is a good idea. It is much more informative and educational to use your own data throughout the various modules. 👍

The only downside I see is whether we know that people walk through each module in the pipeline? Somebody who just wants to do the data analysis, but not the data preparation, should also be helped. In other words, I would consider an alternative for them to be important (maybe this remark is unnecessary and this is already happening!).

Naming

In general, my approach and experience with naming is to leave as little to the imagination as possible. So if for a website, I'm coding a submit button it will either be called SubmitButton (or alternatively Button with the submit argument).

In the scenario you mention, looking at how linelist works and is used here, I don't see any reason to keep the date_var naming in this scenario as it is literally only used a few lines later. Here it is indeed convoluted and can be cleaned up. I also tried it out and the template runs without issues.


I hope that is of some help to the discussion. Still getting the hang of how everything works and relates, so feel free to call out anything as wrong :-)