epiverse-trace / epiparameter

R package with library of epidemiological parameters for infectious diseases and functions and classes for working with parameters
https://epiverse-trace.github.io/epiparameter
Other
32 stars 11 forks source link

Make it possible to operate on columns: `epiparam` -> `data.frame` downgrading #100

Closed jamesmbaazam closed 1 year ago

jamesmbaazam commented 1 year ago

I imagine that users would sometimes want to manipulate epiparam objects on the columns, so consider allowing for the downgrading of the epiparam class.

Additional context

Here is an example where I was trying to find the unique diseases in the database, but got an error.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
a <- epiparameter::epiparam()
a %>% dplyr::distinct(disease)
#> Error in validate_epiparam(NextMethod()): epiparam object does not contain the correct columns

Created on 2023-03-01 with reprex v2.0.2

avallecam commented 1 year ago

Since the epiparam object is of class data.frame in the background, we can use dplyr::as_tibble() before using other dplyr functions:

library(epiparameter)
library(tidyverse)
a <- epiparameter::epiparam()
class(a)
#> [1] "epiparam"   "data.frame"
a %>% 
  # warning: this step breaks the epiparam class object
  as_tibble() %>%  
  dplyr::distinct(disease)
#> # A tibble: 24 × 1
#>    disease                      
#>    <chr>                        
#>  1 Adenovirus                   
#>  2 Chikungunya                  
#>  3 COVID-19                     
#>  4 Dengue                       
#>  5 Ebola Virus Disease          
#>  6 Hantavirus Pulmonary Syndrome
#>  7 Human Coronavirus            
#>  8 Influenza                    
#>  9 Japanese Encephalitis        
#> 10 Marburg Virus Disease        
#> # … with 14 more rows

Created on 2023-03-11 with reprex v2.0.2

avallecam commented 1 year ago

Although, in #111 I used dplyr::filter() directly after the epiparam() output and it worked nicely.

After using dplyr::select() it says that is needed to keep all the columns of the original epiparam class object. For that reason, using dplyr::filter() works fine. I think that having this keeps the communication with epidist() secure.

library(epiparameter)
library(tidyverse)

eparams <- epiparam()

eparams %>% 
  filter(disease=="Influenza")
#> Epiparam object
#> Number of distributions in library: 17
#> Number of diseases: 1
#> Number of delay distributions: 17
#> Number of offspring distributions: 0
#> Number of studies in library: 10
#> <Head of library>
#>     disease  epi_distribution prob_distribution
#> 1 Influenza   generation_time           weibull
#> 2 Influenza incubation_period             gamma
#> 3 Influenza incubation_period             lnorm
#> 4 Influenza incubation_period             lnorm
#> 5 Influenza incubation_period             lnorm
#> 6 Influenza incubation_period             lnorm
#> <11 more rows & 53 more cols not shown>

eparams %>% 
  select(disease)
#> Error in validate_epiparam(NextMethod()): epiparam object does not contain the correct columns

Created on 2023-03-11 with reprex v2.0.2

As user, the take-home message for me is to break the epiparam class object with dplyr::as_tibble() to explore the data as freely as I need. After identifying my specific set of filters, then apply them directly to epiparam() for further connection with epidist()

avallecam commented 1 year ago

Also related, I just encountered the Editorial decisions of the Epi R Handbook.

Subject | Considered | Outcome | Brief rationale -- | -- | -- | -- General coding approach | tidyverse, data.table, base | tidyverse, with a page on data.table, and mentions of base alternatives for readers with no internet | tidyverse readability, universality, most-taught

We can discuss if these decisions can also apply to package documentation, and be registered in the blueprints also in table format as a summary.

For this issue, to specifically generate intermediate outputs tidyverse-friendly or use it in the documentation as visible alternatives for tidyverse users. Related, I also proposed this for {finalsize} https://github.com/epiverse-trace/finalsize/issues/138 and could be applied across packages.

joshwlambert commented 1 year ago

Thanks both for input on this topic. I agree with both of your points.

@jamesmbaazam the reason your example causes an errors is due to both the <epiparam> class and dplyr::distinct(). Currently the <epiparam> class is setup to error when class invariants (i.e. certain columns) are removed. You can see this implementation in the [.epiparam function. Also the use of dplyr::distinct() in your example is only returning a single column due to the .keep_all being FALSE by default. This also explains why @avallecam did not have this problem when applying dplyr::filter() to <epiparam> objects.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(a = 1:5, b = 11:15, c = 21:25)
df
#>   a  b  c
#> 1 1 11 21
#> 2 2 12 22
#> 3 3 13 23
#> 4 4 14 24
#> 5 5 15 25
df %>% dplyr::distinct(a, .keep_all = FALSE)
#>   a
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
df %>% dplyr::distinct(a, .keep_all = TRUE)
#>   a  b  c
#> 1 1 11 21
#> 2 2 12 22
#> 3 3 13 23
#> 4 4 14 24
#> 5 5 15 25

Created on 2023-03-21 with reprex v2.0.2

joshwlambert commented 1 year ago

However, I do agree that this feature may unnecessarily impede users from inspecting the data and using a range of tidyverse operations.

@avallecam I like the idea of converting to a tibble but ideally wouldn't take on the dependency just to operate on <epiparam> objects.

Therefore, I propose another solution. In the subsetting, instead of failing when class invariants are removed, remove the epiparam class and return a data.frame with a message to the user. This will have consequences for the conversion to <epidist> but the conversion functions can have strict input checking to make sure there are no issues.

jamesmbaazam commented 1 year ago

Sounds good.

joshwlambert commented 1 year ago

This issue is tackled by PR #125 and will be closed when it is merged.

To clarify how the new changes will impact the above code chunks:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
a <- epiparameter::epiparam()
a %>% dplyr::distinct(disease)
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#>                          disease
#> 1                     Adenovirus
#> 2                    Chikungunya
#> 3                       COVID-19
#> 4                         Dengue
#> 5            Ebola Virus Disease
#> 6  Hantavirus Pulmonary Syndrome
#> 7              Human Coronavirus
#> 8                      Influenza
#> 9          Japanese Encephalitis
#> 10         Marburg Virus Disease
#> 11                       Measles
#> 12                          MERS
#> 13                     Monkeypox
#> 14                          Mpox
#> 15                 Parainfluenza
#> 16              Pneumonic Plague
#> 17                    Rhinovirus
#> 18             Rift Valley Fever
#> 19                           RSV
#> 20                          SARS
#> 21                      Smallpox
#> 22               West Nile Fever
#> 23                  Yellow Fever
#> 24            Zika Virus Disease
a %>% dplyr::filter(disease == "Influenza")
#> Epiparam object
#> Number of distributions in library: 17
#> Number of diseases: 1
#> Number of delay distributions: 17
#> Number of offspring distributions: 0
#> Number of studies in library: 10
#> <Head of library>
#>     disease  epi_distribution prob_distribution
#> 1 Influenza   generation_time           weibull
#> 2 Influenza incubation_period             gamma
#> 3 Influenza incubation_period             lnorm
#> 4 Influenza incubation_period             lnorm
#> 5 Influenza incubation_period             lnorm
#> 6 Influenza incubation_period             lnorm
#> <11 more rows & 53 more cols not shown>
a %>% dplyr::select(disease)
#> Removing crucial column in `<epiparam>` returning `<data.frame>`
#>                           disease
#> 1                      Adenovirus
#> 2                     Chikungunya
#> 3                        COVID-19
#> 4                        COVID-19
#> 5                        COVID-19
#> 6                        COVID-19
#> 7                        COVID-19
#> 8                        COVID-19
#> 9                        COVID-19
#> 10                       COVID-19
#> 11                       COVID-19
#> 12                       COVID-19
#> 13                       COVID-19
#> 14                       COVID-19
#> 15                       COVID-19
#> 16                       COVID-19
#> 17                       COVID-19
#> 18                       COVID-19
#> 19                       COVID-19
#> 20                       COVID-19
#> 21                       COVID-19
#> 22                       COVID-19
#> 23                       COVID-19
#> 24                       COVID-19
#> 25                       COVID-19
#> 26                         Dengue
#> 27                         Dengue
#> 28                         Dengue
#> 29                         Dengue
#> 30                         Dengue
#> 31            Ebola Virus Disease
#> 32            Ebola Virus Disease
#> 33            Ebola Virus Disease
#> 34            Ebola Virus Disease
#> 35            Ebola Virus Disease
#> 36            Ebola Virus Disease
#> 37            Ebola Virus Disease
#> 38            Ebola Virus Disease
#> 39            Ebola Virus Disease
#> 40            Ebola Virus Disease
#> 41            Ebola Virus Disease
#> 42            Ebola Virus Disease
#> 43            Ebola Virus Disease
#> 44            Ebola Virus Disease
#> 45            Ebola Virus Disease
#> 46            Ebola Virus Disease
#> 47            Ebola Virus Disease
#> 48  Hantavirus Pulmonary Syndrome
#> 49              Human Coronavirus
#> 50                      Influenza
#> 51                      Influenza
#> 52                      Influenza
#> 53                      Influenza
#> 54                      Influenza
#> 55                      Influenza
#> 56                      Influenza
#> 57                      Influenza
#> 58                      Influenza
#> 59                      Influenza
#> 60                      Influenza
#> 61                      Influenza
#> 62                      Influenza
#> 63                      Influenza
#> 64                      Influenza
#> 65                      Influenza
#> 66                      Influenza
#> 67          Japanese Encephalitis
#> 68          Marburg Virus Disease
#> 69          Marburg Virus Disease
#> 70          Marburg Virus Disease
#> 71          Marburg Virus Disease
#> 72          Marburg Virus Disease
#> 73                        Measles
#> 74                           MERS
#> 75                           MERS
#> 76                           MERS
#> 77                           MERS
#> 78                           MERS
#> 79                           MERS
#> 80                           MERS
#> 81                           MERS
#> 82                      Monkeypox
#> 83                           Mpox
#> 84                           Mpox
#> 85                           Mpox
#> 86                           Mpox
#> 87                  Parainfluenza
#> 88               Pneumonic Plague
#> 89                     Rhinovirus
#> 90              Rift Valley Fever
#> 91                            RSV
#> 92                            RSV
#> 93                            RSV
#> 94                           SARS
#> 95                           SARS
#> 96                           SARS
#> 97                       Smallpox
#> 98                       Smallpox
#> 99                       Smallpox
#> 100                      Smallpox
#> 101               West Nile Fever
#> 102               West Nile Fever
#> 103               West Nile Fever
#> 104                  Yellow Fever
#> 105                  Yellow Fever
#> 106            Zika Virus Disease

Created on 2023-04-03 with reprex v2.0.2

joshwlambert commented 1 year ago

The reason dplyr::distinct() causes the message to print twice is because it calls dplyr::dplyr_col_select() twice. https://github.com/tidyverse/dplyr/blob/main/R/distinct.R#L138-L142

joshwlambert commented 1 year ago

Changes were merged in #125. Closing.