how to know which variables are available to subset by in a `multi_epidist` object?

joshwlambert commented 7 months ago

Discussed in https://github.com/epiverse-trace/epiparameter/discussions/217

^{Originally posted by **avallecam** November 30, 2023} From the examples, we can filter by `sample_size` (code 1). In a `multi_epidist` object from `epidist_db()` of two elements (code 2), I tried to apply that filter but `object 'sample_size' not found` (code 3). Is there a way to see what variables are available for a `multi_epidist` object? ``` r # code 1 epiparameter::epidist_db( disease = "SARS", epi_dist = "offspring_distribution", subset = sample_size > 40 ) #> Returning 1 results that match the criteria (1 are parameterised). #> Use subset to filter by entry variables or single_epidist to return a single entry. #> To retrieve the short citation for each use the 'get_citation' function #> Disease: SARS #> Pathogen: SARS-Cov-1 #> Epi Distribution: offspring distribution #> Study: Lloyd-Smith J, Schreiber S, Kopp P, Getz W (2005). "Superspreading and #> the effect of individual variation on disease emergence." _Nature_. #> doi:10.1038/nature04153 . #> Distribution: nbinom #> Parameters: #> mean: 1.630 #> dispersion: 0.160 # code 2 epiparameter::epidist_db( disease = "covid", epi_dist = "incubation", author = "McAloon" ) #> Returning 2 results that match the criteria (2 are parameterised). #> Use subset to filter by entry variables or single_epidist to return a single entry. #> To retrieve the short citation for each use the 'get_citation' function #> [[1]] #> Disease: COVID-19 #> Pathogen: SARS-CoV-2 #> Epi Distribution: incubation period #> Study: McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M, #> Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020). #> "Incubation period of COVID-19: a rapid systematic review and #> meta-analysis of observational research." _BMJ Open_. #> doi:10.1136/bmjopen-2020-039652 #> . #> Distribution: lnorm #> Parameters: #> meanlog: 1.660 #> sdlog: 0.480 #> #> [[2]] #> Disease: COVID-19 #> Pathogen: SARS-CoV-2 #> Epi Distribution: incubation period #> Study: McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M, #> Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020). #> "Incubation period of COVID-19: a rapid systematic review and #> meta-analysis of observational research." _BMJ Open_. #> doi:10.1136/bmjopen-2020-039652 #> . #> Distribution: lnorm #> Parameters: #> meanlog: 1.630 #> sdlog: 0.500 #> #> attr(,"class") #> [1] "multi_epidist" # code 3 epiparameter::epidist_db( disease = "covid", epi_dist = "incubation", author = "McAloon", subset = sample_size > 10 ) #> Error in epiparameter::epidist_db(disease = "covid", epi_dist = "incubation", : object 'sample_size' not found ``` ^{Created on 2023-11-30 with [reprex v2.0.2](https://reprex.tidyverse.org)}

joshwlambert commented 7 months ago

@avallecam thanks for raising this. I've fixed the problem that meant you were unable to subset by both the author and the subset option (in this case the sample_size). Your original code should now work as expected once #226 is merged.

To answer your other question:

Is there a way to see what variables are available for a multi_epidist object?

There is not a standard method to see which variables can be used to subset. We provide the most common subsetting types are arguments to epidist_db() and the subset argument can be used for more unusual subsetting. The <epidist> structure is consistent so the best way to see which variables are there to be subset by is to use the $ to look into an <epidist> object.

If you think of a better way to inform users on subsetting options please let me know and I'll happily implement it into the package.

avallecam commented 7 months ago

If you think of a better way to inform users on subsetting options please let me know and I'll happily implement it into the package.

I would not say that this is a better way, but flexible. As a user, I liked the possibility of using epiparam() + dplyr::filter() as a combo to explore the whole database before using epidist_db() or at list inherit the class until broken intentionally by the user.

I raised #224 and #225 mostly because epiparam() was no longer exported to the namespace. If you can point me to the key discussions that led to this decision in #197 will be highly informative to me. Because, I still think that keeping it would keep the previous flexibility.

This was previously used in:

100
111
191

joshwlambert commented 4 months ago

@avallecam I cannot remember whether there was any recorded discussion before the changes to remove <epiparam> from {epiparameter}. I will search and if I find any I will link to it in this issue. Some initial comments about package complexity were raised in #151, and these led to the major refactor in #197. There were certain limitations with storing the data in a tabular form, both when saved as a csv or read into R as a <data.frame>.

However, it seems clear that the need/benefit of having and easily accessible tabular form to the epidemiological parameter library is there so I propose adding an as.data.frame() method for <multi_epidist> to easily enable this transformation.

avallecam commented 4 months ago

However, it seems clear that the need/benefit of having and easily accessible tabular form to the epidemiological parameter library is there so I propose adding an as.data.frame() method for <multi_epidist> to easily enable this transformation.

Thank you for sharing the context of the limitations that guided that removal. I think you did get a sense of my main intention from #225 and this issue. If {epiparameter} can list me literature-reviewed distributions, I would also want to access any of them apart from the easiness (of following the selection criteria) that single_epidist provides. Different users may prefer to use different ways to access them, so keeping that gate open may be a sensible decision.

For future plans, if {epiparameter} ends up being an API from a database server online, then {epiparameter} may be flexible enough to apply SQL commands before importing the data to an R session (like backends I read listed in {dbplyr}).

epiverse-trace / epiparameter

how to know which variables are available to subset by in a `multi_epidist` object? #224

Discussed in https://github.com/epiverse-trace/epiparameter/discussions/217

100

111

191