epiverse-trace / epiparameter

R package with library of epidemiological parameters for infectious diseases and functions and classes for working with parameters
27 stars 11 forks source link

how to know which variables are available to subset by in a `multi_epidist` object? #224

Closed joshwlambert closed 4 months ago

joshwlambert commented 7 months ago

Discussed in https://github.com/epiverse-trace/epiparameter/discussions/217

Originally posted by **avallecam** November 30, 2023 From the examples, we can filter by `sample_size` (code 1). In a `multi_epidist` object from `epidist_db()` of two elements (code 2), I tried to apply that filter but `object 'sample_size' not found` (code 3). Is there a way to see what variables are available for a `multi_epidist` object? ``` r # code 1 epiparameter::epidist_db( disease = "SARS", epi_dist = "offspring_distribution", subset = sample_size > 40 ) #> Returning 1 results that match the criteria (1 are parameterised). #> Use subset to filter by entry variables or single_epidist to return a single entry. #> To retrieve the short citation for each use the 'get_citation' function #> Disease: SARS #> Pathogen: SARS-Cov-1 #> Epi Distribution: offspring distribution #> Study: Lloyd-Smith J, Schreiber S, Kopp P, Getz W (2005). "Superspreading and #> the effect of individual variation on disease emergence." _Nature_. #> doi:10.1038/nature04153 . #> Distribution: nbinom #> Parameters: #> mean: 1.630 #> dispersion: 0.160 # code 2 epiparameter::epidist_db( disease = "covid", epi_dist = "incubation", author = "McAloon" ) #> Returning 2 results that match the criteria (2 are parameterised). #> Use subset to filter by entry variables or single_epidist to return a single entry. #> To retrieve the short citation for each use the 'get_citation' function #> [[1]] #> Disease: COVID-19 #> Pathogen: SARS-CoV-2 #> Epi Distribution: incubation period #> Study: McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M, #> Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020). #> "Incubation period of COVID-19: a rapid systematic review and #> meta-analysis of observational research." _BMJ Open_. #> doi:10.1136/bmjopen-2020-039652 #> . #> Distribution: lnorm #> Parameters: #> meanlog: 1.660 #> sdlog: 0.480 #> #> [[2]] #> Disease: COVID-19 #> Pathogen: SARS-CoV-2 #> Epi Distribution: incubation period #> Study: McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M, #> Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020). #> "Incubation period of COVID-19: a rapid systematic review and #> meta-analysis of observational research." _BMJ Open_. #> doi:10.1136/bmjopen-2020-039652 #> . #> Distribution: lnorm #> Parameters: #> meanlog: 1.630 #> sdlog: 0.500 #> #> attr(,"class") #> [1] "multi_epidist" # code 3 epiparameter::epidist_db( disease = "covid", epi_dist = "incubation", author = "McAloon", subset = sample_size > 10 ) #> Error in epiparameter::epidist_db(disease = "covid", epi_dist = "incubation", : object 'sample_size' not found ``` Created on 2023-11-30 with [reprex v2.0.2](https://reprex.tidyverse.org)
joshwlambert commented 7 months ago

@avallecam thanks for raising this. I've fixed the problem that meant you were unable to subset by both the author and the subset option (in this case the sample_size). Your original code should now work as expected once #226 is merged.

To answer your other question:

Is there a way to see what variables are available for a multi_epidist object?

There is not a standard method to see which variables can be used to subset. We provide the most common subsetting types are arguments to epidist_db() and the subset argument can be used for more unusual subsetting. The <epidist> structure is consistent so the best way to see which variables are there to be subset by is to use the $ to look into an <epidist> object.

If you think of a better way to inform users on subsetting options please let me know and I'll happily implement it into the package.

avallecam commented 7 months ago

If you think of a better way to inform users on subsetting options please let me know and I'll happily implement it into the package.

I would not say that this is a better way, but flexible. As a user, I liked the possibility of using epiparam() + dplyr::filter() as a combo to explore the whole database before using epidist_db() or at list inherit the class until broken intentionally by the user.

I raised #224 and #225 mostly because epiparam() was no longer exported to the namespace. If you can point me to the key discussions that led to this decision in #197 will be highly informative to me. Because, I still think that keeping it would keep the previous flexibility.

This was previously used in:

joshwlambert commented 4 months ago

@avallecam I cannot remember whether there was any recorded discussion before the changes to remove <epiparam> from {epiparameter}. I will search and if I find any I will link to it in this issue. Some initial comments about package complexity were raised in #151, and these led to the major refactor in #197. There were certain limitations with storing the data in a tabular form, both when saved as a csv or read into R as a <data.frame>.

However, it seems clear that the need/benefit of having and easily accessible tabular form to the epidemiological parameter library is there so I propose adding an as.data.frame() method for <multi_epidist> to easily enable this transformation.

avallecam commented 4 months ago

However, it seems clear that the need/benefit of having and easily accessible tabular form to the epidemiological parameter library is there so I propose adding an as.data.frame() method for <multi_epidist> to easily enable this transformation.

Thank you for sharing the context of the limitations that guided that removal. I think you did get a sense of my main intention from #225 and this issue. If {epiparameter} can list me literature-reviewed distributions, I would also want to access any of them apart from the easiness (of following the selection criteria) that single_epidist provides. Different users may prefer to use different ways to access them, so keeping that gate open may be a sensible decision.

For future plans, if {epiparameter} ends up being an API from a database server online, then {epiparameter} may be flexible enough to apply SQL commands before importing the data to an R session (like backends I read listed in {dbplyr}).