epiverse-trace / datatagr

Tagging, validating, and safeguarding data to help harden data pipelines.
https://epiverse-trace.github.io/datatagr/
Other
1 stars 0 forks source link

why there is no a `datatagr::tags_df()` function? #47

Open avallecam opened 6 days ago

avallecam commented 6 days ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

linelist::tags_df() is not comparable to datatagr::labels_df()

Can we have a function in datatagr that still inherits the power of tagging columns to get a validated set of them for secure downstream analysis? Is datatagr::make_datatagr() in the capacity to create a tagged dataframe? If this has been discussed elsewhere, I am happy to read it.

In the reprex below I compare package features.

library(datatagr)
library(linelist)
library(labelled)
library(dplyr)

# linelist ----------------------------------------------------------------

dataset <- outbreaks::mers_korea_2015$linelist

dataset %>% 
  dplyr::as_tibble() %>% 
  linelist::make_linelist(
    location = "place_infect",
    date_onset = "dt_onset"
  ) %>% 
  linelist::validate_linelist() %>% 
  linelist::tags_df()
#> # A tibble: 162 × 2
#>    date_onset location           
#>    <date>     <fct>              
#>  1 2015-05-11 Middle East        
#>  2 2015-05-18 Outside Middle East
#>  3 2015-05-20 Outside Middle East
#>  4 2015-05-25 Outside Middle East
#>  5 2015-05-25 Outside Middle East
#>  6 2015-05-24 Outside Middle East
#>  7 2015-05-21 Outside Middle East
#>  8 2015-05-26 Outside Middle East
#>  9 NA         Outside Middle East
#> 10 2015-05-21 Outside Middle East
#> # ℹ 152 more rows

# datatagr ----------------------------------------------------------------

datatagr_out <- cars %>% 
  dplyr::as_tibble() %>% 
  # Create a datatagr object
  datatagr::make_datatagr(
    speed = 'Miles per hour'
  ) %>% 
  # Validate the data are of a specific type
  datatagr::validate_datatagr(
    speed = 'numeric'
  ) %>% 
  # extract dataframe of labelled variables
  datatagr::labels_df()

datatagr_out
#> # A tibble: 50 × 2
#>    `Miles per hour`  dist
#>               <dbl> <dbl>
#>  1                4     2
#>  2                4    10
#>  3                7     4
#>  4                7    22
#>  5                8    16
#>  6                9    10
#>  7               10    18
#>  8               10    26
#>  9               10    34
#> 10               11    17
#> # ℹ 40 more rows

# The action below may not be expected to be done in an analysis pipeline

datatagr_out %>% 
  # standardize column names of a data frame
  cleanepi::standardize_column_names()
#> # A tibble: 50 × 2
#>    miles_per_hour  dist
#>             <dbl> <dbl>
#>  1              4     2
#>  2              4    10
#>  3              7     4
#>  4              7    22
#>  5              8    16
#>  6              9    10
#>  7             10    18
#>  8             10    26
#>  9             10    34
#> 10             11    17
#> # ℹ 40 more rows

# labelled ----------------------------------------------------------------

var_label(cars) <- list(
  speed = 'Miles per hour'
)

cars %>% 
  labelled::var_label()
#> $speed
#> [1] "Miles per hour"
#> 
#> $dist
#> NULL

Created on 2024-10-08 with reprex v2.1.1

Describe the solution you'd like A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

chartgerink commented 6 days ago

In direct response to the issue title: There is no tags_df() because the naming of tags has been dropped throughout the package (pending the rename of the package).

All functionality that remains is indeed labels_df(), and good to hear the feedback around how it does or does not work for you 😊 We will not be reintroducing the tags_df() as the naming does not fit, but I am happy to consider your second suggested change for integration ("get only the labelled columns"). It may make sense to only have the labelled and validated ones in there. In order to make that comparison, could you add a direct comparison between linelist and datatagr, for the same data?

Your third proposed change ("get standardised column names"), I am not sure about. The package scope is not to wrangle variable names into a prettier format. In your example, the renaming of speed into miles_per_hour does not necessarily make the output of labels_df more usable, if we also retain the labels. It may make sense if we drop the label attribute when using labels_df, and put the label information in the variable name (snake_case formatted), but not both. Would you be okay with dropping the labels and interoperability with labelled in that scenario?