Closed joshwlambert closed 6 months ago
Just thinking through the use cases, it might be useful to provide the citation in a flat format, as the
class (surprisingly) doesn't appear to support [[, making it more difficult to filter a data.frame of parameters by year or author, which might be relevant to comparative analyses.
By flat do you mean one <data.frame>
column per <bibentry>
element, or just unclass the <bibentry>
when putting into the list column so it is just a list?
Just thinking through the use cases, it might be useful to provide the citation in a flat format, as the class (surprisingly) doesn't appear to support [[, making it more difficult to filter a data.frame of parameters by year or author, which might be relevant to comparative analyses.
By flat do you mean one
<data.frame>
column per<bibentry>
element, or just unclass the<bibentry>
when putting into the list column so it is just a list?
I meant the former, but the latter might also be viable if users are mostly using the citation data rather than the overall object and any related methods.
We talked about taking on {tibble} as a dependency yesterday since this implementation creates data.frames with list columns, and I think that that is the right way to go - perhaps even in this PR. The function name might need changing in that case.
A tibble with list-columns will be more recognisable to users with Tidyverse training than a data.frame implementing the same. The printing of list-columns and large strings (such as notes) is better in a tibble too. Users would also probably use Tidyverse packages such as {purrr} to work with the list columns, which reduces the user-side cost of taking on {tibble} in {epiparameter}.
This is a nice idea, but I don't think it should be part of the as.data.frame()
method, instead I propose making a future PR adding an as_tibble()
method for <epidist>
which would provide the formatting benefits you mentioned.
whether the list columns hard for users to understand and use?
My assessment is that the output of <named list>
was initially difficult for me, specifically, how to extract information useful for using later subset = sample_size == xx
within epidist_db()
, for example. I needed to explore for a while how to unnest the data using {tidyr} because what we mostly use for this is just tidyr::unnest()
did not work for all columns. While trying this I discovered the existence of tidyr::unnest_wider()
that unnested the data within, e.g., the metadata
column into new columns for each observation (or row), making the sample_size
column visible. However, this method does not work for (solved)citation
and prob_distribution
with two different error messages.
Here is the code from my exploration: (edit: moved reprex to review comment)
With respect to the data requirement that triggered my feature request, the work going on in the parameter_tbl branch directly targets this issue nicely. I'm able to select any combination of authors and subset parameters, which can also help to try iterative calculations potentially using {purrr}
# remotes::install_github("epiverse-trace/epiparameter@parameter_tbl")
library(epiparameter)
library(tidyverse)
epidist_db(
disease = "COVID-19",
epi_dist = "incubation period"
) %>%
epiparameter::distribution_tbl()
#> Returning 15 results that match the criteria (11 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> # Distribution table:
#> # A data frame: 15 × 7
#> disease pathogen epi_distribution prob_distribution author year sample_size
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 COVID-19 SARS-Co… incubation peri… <NA> Men e… 2020 59
#> 2 COVID-19 SARS-Co… incubation peri… <NA> Rai e… 2022 6241
#> 3 COVID-19 SARS-Co… incubation peri… <NA> Alene… 2021 1453
#> 4 COVID-19 SARS-Co… incubation peri… weibull Yang … 2020 178
#> 5 COVID-19 SARS-Co… incubation peri… <NA> Elias… 2021 28675
#> 6 COVID-19 SARS-Co… incubation peri… weibull Bui e… 2020 19
#> 7 COVID-19 SARS-Co… incubation peri… lnorm McAlo… 2020 1357
#> 8 COVID-19 SARS-Co… incubation peri… lnorm McAlo… 2020 1269
#> 9 COVID-19 SARS-Co… incubation peri… lnorm Linto… 2020 52
#> 10 COVID-19 SARS-Co… incubation peri… lnorm Linto… 2020 158
#> 11 COVID-19 SARS-Co… incubation peri… lnorm Linto… 2020 52
#> 12 COVID-19 SARS-Co… incubation peri… lnorm Lauer… 2020 181
#> 13 COVID-19 SARS-Co… incubation peri… lnorm Lauer… 2020 99
#> 14 COVID-19 SARS-Co… incubation peri… lnorm Lauer… 2020 108
#> 15 COVID-19 SARS-Co… incubation peri… lnorm Lauer… 2020 73
epidist_db(
disease = "COVID-19",
epi_dist = "incubation period",
author = "Linton",
subset = sample_size == 158
)
#> Returning 1 results that match the criteria (1 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> Disease: COVID-19
#> Pathogen: SARS-CoV-2
#> Epi Distribution: incubation period
#> Study: Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.
#> Distribution: lnorm
#> Parameters:
#> meanlog: 1.611
#> sdlog: 0.472
Created on 2024-03-08 with reprex v2.1.0
Thanks for the comments. I will now merge this PR. Some of the latter comments are more a response to the developments on the parameter_tbl
branch, so I will revisit them once a PR is open for merged that branch into main
.
This PR addresses #249 by providing a method to dispatch the
as.data.frame()
generic for<epidist>
, and by extension of looping over the<epidist>
method, the<multi_epidist>
classes.This is to provide {epiparameter} data in a tabular form to make it more easily applicable to pipelines, e.g. Tidyverse data analysis pipelines.