Closed jamesmbaazam closed 1 year ago
Since the epiparam object is of class data.frame
in the background, we can use dplyr::as_tibble()
before using other dplyr
functions:
library(epiparameter)
library(tidyverse)
a <- epiparameter::epiparam()
class(a)
#> [1] "epiparam" "data.frame"
a %>%
# warning: this step breaks the epiparam class object
as_tibble() %>%
dplyr::distinct(disease)
#> # A tibble: 24 × 1
#> disease
#> <chr>
#> 1 Adenovirus
#> 2 Chikungunya
#> 3 COVID-19
#> 4 Dengue
#> 5 Ebola Virus Disease
#> 6 Hantavirus Pulmonary Syndrome
#> 7 Human Coronavirus
#> 8 Influenza
#> 9 Japanese Encephalitis
#> 10 Marburg Virus Disease
#> # … with 14 more rows
Created on 2023-03-11 with reprex v2.0.2
Although, in #111 I used dplyr::filter()
directly after the epiparam()
output and it worked nicely.
After using dplyr::select()
it says that is needed to keep all the columns of the original epiparam
class object. For that reason, using dplyr::filter()
works fine. I think that having this keeps the communication with epidist()
secure.
library(epiparameter)
library(tidyverse)
eparams <- epiparam()
eparams %>%
filter(disease=="Influenza")
#> Epiparam object
#> Number of distributions in library: 17
#> Number of diseases: 1
#> Number of delay distributions: 17
#> Number of offspring distributions: 0
#> Number of studies in library: 10
#> <Head of library>
#> disease epi_distribution prob_distribution
#> 1 Influenza generation_time weibull
#> 2 Influenza incubation_period gamma
#> 3 Influenza incubation_period lnorm
#> 4 Influenza incubation_period lnorm
#> 5 Influenza incubation_period lnorm
#> 6 Influenza incubation_period lnorm
#> <11 more rows & 53 more cols not shown>
eparams %>%
select(disease)
#> Error in validate_epiparam(NextMethod()): epiparam object does not contain the correct columns
Created on 2023-03-11 with reprex v2.0.2
As user, the take-home message for me is to break the epiparam
class object with dplyr::as_tibble()
to explore the data as freely as I need. After identifying my specific set of filters, then apply them directly to epiparam()
for further connection with epidist()
Also related, I just encountered the Editorial decisions of the Epi R Handbook.
Subject | Considered | Outcome | Brief rationale -- | -- | -- | -- General coding approach | tidyverse, data.table, base | tidyverse, with a page on data.table, and mentions of base alternatives for readers with no internet | tidyverse readability, universality, most-taughtWe can discuss if these decisions can also apply to package documentation, and be registered in the blueprints also in table format as a summary.
For this issue, to specifically generate intermediate outputs tidyverse-friendly or use it in the documentation as visible alternatives for tidyverse users. Related, I also proposed this for {finalsize} https://github.com/epiverse-trace/finalsize/issues/138 and could be applied across packages.
Thanks both for input on this topic. I agree with both of your points.
@jamesmbaazam the reason your example causes an errors is due to both the <epiparam>
class and dplyr::distinct()
. Currently the <epiparam>
class is setup to error when class invariants (i.e. certain columns) are removed. You can see this implementation in the [.epiparam
function. Also the use of dplyr::distinct()
in your example is only returning a single column due to the .keep_all
being FALSE
by default. This also explains why @avallecam did not have this problem when applying dplyr::filter()
to <epiparam>
objects.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(a = 1:5, b = 11:15, c = 21:25)
df
#> a b c
#> 1 1 11 21
#> 2 2 12 22
#> 3 3 13 23
#> 4 4 14 24
#> 5 5 15 25
df %>% dplyr::distinct(a, .keep_all = FALSE)
#> a
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
df %>% dplyr::distinct(a, .keep_all = TRUE)
#> a b c
#> 1 1 11 21
#> 2 2 12 22
#> 3 3 13 23
#> 4 4 14 24
#> 5 5 15 25
Created on 2023-03-21 with reprex v2.0.2
However, I do agree that this feature may unnecessarily impede users from inspecting the data and using a range of tidyverse operations.
@avallecam I like the idea of converting to a tibble
but ideally wouldn't take on the dependency just to operate on <epiparam>
objects.
Therefore, I propose another solution. In the subsetting, instead of failing when class invariants are removed, remove the epiparam
class and return a data.frame
with a message to the user. This will have consequences for the conversion to <epidist>
but the conversion functions can have strict input checking to make sure there are no issues.
Sounds good.
This issue is tackled by PR #125 and will be closed when it is merged.
To clarify how the new changes will impact the above code chunks:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
a <- epiparameter::epiparam()
a %>% dplyr::distinct(disease)
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#> disease
#> 1 Adenovirus
#> 2 Chikungunya
#> 3 COVID-19
#> 4 Dengue
#> 5 Ebola Virus Disease
#> 6 Hantavirus Pulmonary Syndrome
#> 7 Human Coronavirus
#> 8 Influenza
#> 9 Japanese Encephalitis
#> 10 Marburg Virus Disease
#> 11 Measles
#> 12 MERS
#> 13 Monkeypox
#> 14 Mpox
#> 15 Parainfluenza
#> 16 Pneumonic Plague
#> 17 Rhinovirus
#> 18 Rift Valley Fever
#> 19 RSV
#> 20 SARS
#> 21 Smallpox
#> 22 West Nile Fever
#> 23 Yellow Fever
#> 24 Zika Virus Disease
a %>% dplyr::filter(disease == "Influenza")
#> Epiparam object
#> Number of distributions in library: 17
#> Number of diseases: 1
#> Number of delay distributions: 17
#> Number of offspring distributions: 0
#> Number of studies in library: 10
#> <Head of library>
#> disease epi_distribution prob_distribution
#> 1 Influenza generation_time weibull
#> 2 Influenza incubation_period gamma
#> 3 Influenza incubation_period lnorm
#> 4 Influenza incubation_period lnorm
#> 5 Influenza incubation_period lnorm
#> 6 Influenza incubation_period lnorm
#> <11 more rows & 53 more cols not shown>
a %>% dplyr::select(disease)
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#> disease
#> 1 Adenovirus
#> 2 Chikungunya
#> 3 COVID-19
#> 4 COVID-19
#> 5 COVID-19
#> 6 COVID-19
#> 7 COVID-19
#> 8 COVID-19
#> 9 COVID-19
#> 10 COVID-19
#> 11 COVID-19
#> 12 COVID-19
#> 13 COVID-19
#> 14 COVID-19
#> 15 COVID-19
#> 16 COVID-19
#> 17 COVID-19
#> 18 COVID-19
#> 19 COVID-19
#> 20 COVID-19
#> 21 COVID-19
#> 22 COVID-19
#> 23 COVID-19
#> 24 COVID-19
#> 25 COVID-19
#> 26 Dengue
#> 27 Dengue
#> 28 Dengue
#> 29 Dengue
#> 30 Dengue
#> 31 Ebola Virus Disease
#> 32 Ebola Virus Disease
#> 33 Ebola Virus Disease
#> 34 Ebola Virus Disease
#> 35 Ebola Virus Disease
#> 36 Ebola Virus Disease
#> 37 Ebola Virus Disease
#> 38 Ebola Virus Disease
#> 39 Ebola Virus Disease
#> 40 Ebola Virus Disease
#> 41 Ebola Virus Disease
#> 42 Ebola Virus Disease
#> 43 Ebola Virus Disease
#> 44 Ebola Virus Disease
#> 45 Ebola Virus Disease
#> 46 Ebola Virus Disease
#> 47 Ebola Virus Disease
#> 48 Hantavirus Pulmonary Syndrome
#> 49 Human Coronavirus
#> 50 Influenza
#> 51 Influenza
#> 52 Influenza
#> 53 Influenza
#> 54 Influenza
#> 55 Influenza
#> 56 Influenza
#> 57 Influenza
#> 58 Influenza
#> 59 Influenza
#> 60 Influenza
#> 61 Influenza
#> 62 Influenza
#> 63 Influenza
#> 64 Influenza
#> 65 Influenza
#> 66 Influenza
#> 67 Japanese Encephalitis
#> 68 Marburg Virus Disease
#> 69 Marburg Virus Disease
#> 70 Marburg Virus Disease
#> 71 Marburg Virus Disease
#> 72 Marburg Virus Disease
#> 73 Measles
#> 74 MERS
#> 75 MERS
#> 76 MERS
#> 77 MERS
#> 78 MERS
#> 79 MERS
#> 80 MERS
#> 81 MERS
#> 82 Monkeypox
#> 83 Mpox
#> 84 Mpox
#> 85 Mpox
#> 86 Mpox
#> 87 Parainfluenza
#> 88 Pneumonic Plague
#> 89 Rhinovirus
#> 90 Rift Valley Fever
#> 91 RSV
#> 92 RSV
#> 93 RSV
#> 94 SARS
#> 95 SARS
#> 96 SARS
#> 97 Smallpox
#> 98 Smallpox
#> 99 Smallpox
#> 100 Smallpox
#> 101 West Nile Fever
#> 102 West Nile Fever
#> 103 West Nile Fever
#> 104 Yellow Fever
#> 105 Yellow Fever
#> 106 Zika Virus Disease
Created on 2023-04-03 with reprex v2.0.2
The reason dplyr::distinct()
causes the message to print twice is because it calls dplyr::dplyr_col_select()
twice. https://github.com/tidyverse/dplyr/blob/main/R/distinct.R#L138-L142
Changes were merged in #125. Closing.
I imagine that users would sometimes want to manipulate
epiparam
objects on the columns, so consider allowing for the downgrading of theepiparam
class.Additional context
Here is an example where I was trying to find the unique diseases in the database, but got an error.
Created on 2023-03-01 with reprex v2.0.2