Add `as.data.frame()` methods for `<epidist>` & `<multi_epidist>`

joshwlambert commented 8 months ago

This PR addresses #249 by providing a method to dispatch the as.data.frame() generic for <epidist>, and by extension of looping over the <epidist> method, the <multi_epidist> classes.

This is to provide {epiparameter} data in a tabular form to make it more easily applicable to pipelines, e.g. Tidyverse data analysis pipelines.

joshwlambert commented 8 months ago

Just thinking through the use cases, it might be useful to provide the citation in a flat format, as the class (surprisingly) doesn't appear to support [[, making it more difficult to filter a data.frame of parameters by year or author, which might be relevant to comparative analyses.

By flat do you mean one <data.frame> column per <bibentry> element, or just unclass the <bibentry> when putting into the list column so it is just a list?

pratikunterwegs commented 8 months ago

Just thinking through the use cases, it might be useful to provide the citation in a flat format, as the class (surprisingly) doesn't appear to support [[, making it more difficult to filter a data.frame of parameters by year or author, which might be relevant to comparative analyses.

By flat do you mean one <data.frame> column per <bibentry> element, or just unclass the <bibentry> when putting into the list column so it is just a list?

I meant the former, but the latter might also be viable if users are mostly using the citation data rather than the overall object and any related methods.

joshwlambert commented 8 months ago

We talked about taking on {tibble} as a dependency yesterday since this implementation creates data.frames with list columns, and I think that that is the right way to go - perhaps even in this PR. The function name might need changing in that case.

A tibble with list-columns will be more recognisable to users with Tidyverse training than a data.frame implementing the same. The printing of list-columns and large strings (such as notes) is better in a tibble too. Users would also probably use Tidyverse packages such as {purrr} to work with the list columns, which reduces the user-side cost of taking on {tibble} in {epiparameter}.

This is a nice idea, but I don't think it should be part of the as.data.frame() method, instead I propose making a future PR adding an as_tibble() method for <epidist> which would provide the formatting benefits you mentioned.

avallecam commented 8 months ago

whether the list columns hard for users to understand and use?

My assessment is that the output of <named list> was initially difficult for me, specifically, how to extract information useful for using later subset = sample_size == xx within epidist_db(), for example. I needed to explore for a while how to unnest the data using {tidyr} because what we mostly use for this is just tidyr::unnest() did not work for all columns. While trying this I discovered the existence of tidyr::unnest_wider() that unnested the data within, e.g., the metadata column into new columns for each observation (or row), making the sample_size column visible. ~~However, this method does not work for citation and prob_distribution with two different error messages.~~ (solved)

~~Here is the code from my exploration:~~ (edit: moved reprex to review comment)

avallecam commented 8 months ago

With respect to the data requirement that triggered my feature request, the work going on in the parameter_tbl branch directly targets this issue nicely. I'm able to select any combination of authors and subset parameters, which can also help to try iterative calculations potentially using {purrr}

# remotes::install_github("epiverse-trace/epiparameter@parameter_tbl")

library(epiparameter)
library(tidyverse)

epidist_db(
  disease = "COVID-19",
  epi_dist = "incubation period"
) %>% 
  epiparameter::distribution_tbl()
#> Returning 15 results that match the criteria (11 are parameterised). 
#> Use subset to filter by entry variables or single_epidist to return a single entry. 
#> To retrieve the short citation for each use the 'get_citation' function
#> # Distribution table:
#> # A data frame:       15 × 7
#>    disease  pathogen epi_distribution prob_distribution author  year sample_size
#>    <chr>    <chr>    <chr>            <chr>             <chr>  <dbl>       <dbl>
#>  1 COVID-19 SARS-Co… incubation peri… <NA>              Men e…  2020          59
#>  2 COVID-19 SARS-Co… incubation peri… <NA>              Rai e…  2022        6241
#>  3 COVID-19 SARS-Co… incubation peri… <NA>              Alene…  2021        1453
#>  4 COVID-19 SARS-Co… incubation peri… weibull           Yang …  2020         178
#>  5 COVID-19 SARS-Co… incubation peri… <NA>              Elias…  2021       28675
#>  6 COVID-19 SARS-Co… incubation peri… weibull           Bui e…  2020          19
#>  7 COVID-19 SARS-Co… incubation peri… lnorm             McAlo…  2020        1357
#>  8 COVID-19 SARS-Co… incubation peri… lnorm             McAlo…  2020        1269
#>  9 COVID-19 SARS-Co… incubation peri… lnorm             Linto…  2020          52
#> 10 COVID-19 SARS-Co… incubation peri… lnorm             Linto…  2020         158
#> 11 COVID-19 SARS-Co… incubation peri… lnorm             Linto…  2020          52
#> 12 COVID-19 SARS-Co… incubation peri… lnorm             Lauer…  2020         181
#> 13 COVID-19 SARS-Co… incubation peri… lnorm             Lauer…  2020          99
#> 14 COVID-19 SARS-Co… incubation peri… lnorm             Lauer…  2020         108
#> 15 COVID-19 SARS-Co… incubation peri… lnorm             Lauer…  2020          73

epidist_db(
  disease = "COVID-19",
  epi_dist = "incubation period",
  author = "Linton",
  subset = sample_size == 158
)
#> Returning 1 results that match the criteria (1 are parameterised). 
#> Use subset to filter by entry variables or single_epidist to return a single entry. 
#> To retrieve the short citation for each use the 'get_citation' function
#> Disease: COVID-19
#> Pathogen: SARS-CoV-2
#> Epi Distribution: incubation period
#> Study: Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.
#> Distribution: lnorm
#> Parameters:
#>   meanlog: 1.611
#>   sdlog: 0.472

^{Created on 2024-03-08 with reprex v2.1.0}

joshwlambert commented 6 months ago

Thanks for the comments. I will now merge this PR. Some of the latter comments are more a response to the developments on the parameter_tbl branch, so I will revisit them once a PR is open for merged that branch into main.

epiverse-trace / epiparameter

Add `as.data.frame()` methods for `<epidist>` & `<multi_epidist>` #254