darwin-eu-dev / omopgenerics

https://darwin-eu-dev.github.io/omopgenerics/
Apache License 2.0
2 stars 1 forks source link

Assertion on 'name' failed: Contains missing values (element 1). #395

Closed mvankessel-EMC closed 1 month ago

mvankessel-EMC commented 2 months ago

I have a cohort table that is a product of a generated cohort table. And adjusted with various mutate(), inner_join(), and group_by() calls (it is ungrouped at the end).

cdm$stage3_treatments_adjusted

#> # Source:   table<og_031_1720533176> [5 x 4]
#> # Database: DuckDB v0.10.2 [mvankessel@Windows 10 x64:R 4.4.0/D:\R-Study-Packages\some_study\dev\test.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <dbl> <date>            <date>         
#> 1                   14         13 2020-06-15        2022-12-30     
#> 2                    7          9 2019-06-11        2024-04-07     
#> 3                   14          9 2019-06-11        2024-04-07     
#> 4                   14         17 2021-06-15        2022-12-30     
#> 5                   14         11 2020-08-18        2023-02-03   

If I try to make an actual cohort table out of it - required by CohortCharacteristics::summariseLargeScaleCharacteristics() - I get the following error:

cdm$stage3_treatments_adjusted <- omopgenerics::newCohortTable(
  table = cdm$stage3_treatments_adjusted
)

#> Error in insertTable.db_cdm(cdm = tableSource(table), name = name, table = cohortSetRef,  : 
#>  Assertion on 'name' failed: Contains missing values (element 1).

If I collect() the table, insert it into the CDM, and make a cohort table out of it, it works. But that seems a rather tacky work-around, as I have to pull the entire cohort table into memory.

my_cohort_table <- cdm$stage3_treatments_adjusted %>% collect()

cdm <- CDMConnector::insertTable(
  cdm = cdm,
  name = "my_cohort_table",
  table = my_cohort_table
)

cdm$my_cohort_table <- omopgenerics::newCohortTable(table = cdm$my_cohort_table)

omopgenerics::settings(cdm$my_cohort_table)
#> # A tibble: 2 × 2
#>  cohort_definition_id cohort_name
#>                  <int> <chr>      
#> 1                    7 cohort_7   
#> 2                   14 cohort_14  

The classes of the table that I want to make a cohort table out of:

class(cdm$stage3_treatments_adjusted)
#> [1] "cdm_table"             "GeneratedCohortSet"    "tbl_duckdb_connection"
#> [4] "tbl_dbi"               "tbl_sql"               "tbl_lazy"             
#> [7] "tbl"  

Am I just missing something?

catalamarti commented 2 months ago

that's weird, can you reproduce it in your environment? maybe we can setup a call to investigate where is the error @mvankessel-EMC

mvankessel-EMC commented 2 months ago

This is a full reprex. I will send you the files that I'm using in this example.

library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.4.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(omopgenerics)
#> Warning: package 'omopgenerics' was built under R version 4.4.1
#> 
#> Attaching package: 'omopgenerics'
#> The following objects are masked from 'package:CDMConnector':
#> 
#>     cdmName, recordCohortAttrition, uniqueTableName
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(CohortCharacteristics)
#> Warning: package 'CohortCharacteristics' was built under R version 4.4.1
library(PatientProfiles)
#> Warning: package 'PatientProfiles' was built under R version 4.4.1

dbPath <- "./test.duckdb"
cohortPath <- "./cohorts-treatment_patterns/"
tnmDir <- "./TNM_concepts/"

con <- DBI::dbConnect(
  drv = duckdb::duckdb(),
  server = dbPath,
  dbdir = dbPath
)

cdm <- CDMConnector::cdmFromCon(
  con = con,
  cdmSchema = "main",
  writeSchema = "main"
)
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#>  target signature 'duckdb_connection#Id'.
#>  "duckdb_connection#ANY" would also be valid
#> ! cdm name not specified and could not be inferred from the cdm source table

cohortSet <- readCohortSet(path = cohortPath)
cdm <- generateCohortSet(
  cdm = cdm,
  cohortSet = cohortSet,
  name = "dummy_cohort_table"
)
#> ℹ Generating 15 cohorts
#> ℹ Generating cohort (1/15) - atezolizumab✔ Generating cohort (1/15) - atezolizumab [1.4s]
#> ℹ Generating cohort (2/15) - carboplatin✔ Generating cohort (2/15) - carboplatin [1.3s]
#> ℹ Generating cohort (3/15) - cemiplimab✔ Generating cohort (3/15) - cemiplimab [235ms]
#> ℹ Generating cohort (4/15) - cisplatin✔ Generating cohort (4/15) - cisplatin [2s]
#> ℹ Generating cohort (5/15) - docetaxel✔ Generating cohort (5/15) - docetaxel [1.4s]
#> ℹ Generating cohort (6/15) - durvalumab✔ Generating cohort (6/15) - durvalumab [1s]
#> ℹ Generating cohort (7/15) - gemcitabine✔ Generating cohort (7/15) - gemcitabine [1s]
#> ℹ Generating cohort (8/15) - ipilimumab✔ Generating cohort (8/15) - ipilimumab [1.5s]
#> ℹ Generating cohort (9/15) - nivolumab✔ Generating cohort (9/15) - nivolumab [1.6s]
#> ℹ Generating cohort (10/15) - paclitaxel✔ Generating cohort (10/15) - paclitaxel [1.6s]
#> ℹ Generating cohort (11/15) - pembrolizumab✔ Generating cohort (11/15) - pembrolizumab [1.5s]
#> ℹ Generating cohort (12/15) - pemetrexed✔ Generating cohort (12/15) - pemetrexed [1s]
#> ℹ Generating cohort (13/15) - stage_3b_4_2m✔ Generating cohort (13/15) - stage_3b_4_2m [1.2s]
#> ℹ Generating cohort (14/15) - stage_3b_4_2m_prior_lung_cancer_allowed✔ Generating cohort (14/15) - stage_3b_4_2m_prior_lung_cancer_allowed [3.2s]
#> ℹ Generating cohort (15/15) - vinorelbine✔ Generating cohort (15/15) - vinorelbine [700ms]

getEventCohorts <- function(cohortSet) {
  cohortSet %>%
    dplyr::filter(!startsWith(.data$cohort_name, "stage_")) %>%
    dplyr::select("cohort_definition_id", "cohort_name") %>%
    dplyr::rename(cohortId = "cohort_definition_id", cohortName = "cohort_name") %>%
    dplyr::mutate(type = "event")
}

getTargetCohorts <- function(events, cohortSet) {
  cohortSet %>%
    dplyr::filter(!.data$cohort_definition_id %in% events$cohortId) %>%
    dplyr::select("cohort_definition_id", "cohort_name") %>%
    dplyr::rename(cohortId = "cohort_definition_id", cohortName = "cohort_name") %>%
    dplyr::mutate(type = "target")
}

eventCohorts <- cohortSet %>%
  getEventCohorts()

targetCohorts <- cohortSet %>%
  getTargetCohorts(events = eventCohorts)

cohortSet <- dplyr::bind_rows(
  eventCohorts,
  targetCohorts
) %>%
  as.data.frame()

names(cohortSet) <- tolower(names(cohortSet))

cdm <- CDMConnector::insertTable(
  cdm = cdm,
  name = "cohort_set",
  table = cohortSet
)

tnmConceptTable <- lapply(list.files(tnmDir, full.names = TRUE), function(file) {
  tbl <- read.csv(file)
  tbl <- tbl[, c("Id", "Code")]
  tbl$tnm_type <- strsplit(basename(file), "\\.")[[1]][1]
  return(tbl)
}) |>
  dplyr::bind_rows() |>
  dplyr::rename(concept_id = "Id", code = "Code")

cdm <- CDMConnector::insertTable(
  cdm = cdm,
  name = "tnm_concept_table",
  table = tnmConceptTable
)

cdm$dummy_cohort_table <- cdm$dummy_cohort_table %>%
  dplyr::inner_join(cdm$cohort_set, dplyr::join_by(cohort_definition_id == cohortid)) %>%
  dplyr::compute()

cdm$nsclc_cohort_table <- cdm$dummy_cohort_table %>%
  dplyr::filter(.data$type == "target") %>%
  dplyr::compute()

cdm$treatment_cohort_table <- cdm$dummy_cohort_table %>%
  dplyr::filter(.data$type == "event") %>%
  dplyr::compute()

updateTreatmentDates <- function(
    cdm,
    cohortId,
    treatmentCohortTableName,
    TNMs = c("TNM-M0", "TNM-M1", "TNM-N2", "TNM-N3", "TNM-T3_t4")) {
  cdm[[treatmentCohortTableName]] %>%
    dplyr::filter(.data$cohort_definition_id == cohortId) %>%
    dplyr::inner_join(cdm$treatment_cohort_table, dplyr::join_by(subject_id == subject_id)) %>%
    dplyr::select("cohort_definition_id.y", "subject_id", "cohort_start_date.y", "cohort_end_date.y") %>%
    dplyr::rename(
      cohort_definition_id = "cohort_definition_id.y",
      cohort_start_date = "cohort_start_date.y",
      cohort_end_date = "cohort_end_date.y"
    ) %>%
    dplyr::inner_join(cdm$measurement, dplyr::join_by(subject_id == person_id)) %>%
    dplyr::inner_join(cdm$tnm_concept_table, dplyr::join_by(measurement_concept_id == concept_id)) %>%
    dplyr::filter(.data$tnm_type %in% TNMs) %>%
    dplyr::mutate(date_diff = !!CDMConnector::datediff(end = "measurement_date", "cohort_start_date")) %>%
    dplyr::group_by(.data$cohort_definition_id, .data$subject_id) %>%
    dplyr::filter(
      .data$date_diff == min(.data$date_diff, na.rm = TRUE),
      row_number() == 1
    ) %>%
    dplyr::mutate(new_cohort_start_date = dplyr::case_when(
      .data$date_diff <= 0 ~ as.Date(.data$measurement_date)
    )) %>%
    dplyr::select("cohort_definition_id", "subject_id", "new_cohort_start_date", "cohort_end_date") %>%
    dplyr::rename(cohort_start_date = "new_cohort_start_date") %>%
    dplyr::ungroup()
}

cdm$stage3_treatments_adjusted <- cdm %>%
  updateTreatmentDates(
    cohortId = 19,
    treatmentCohortTableName = "nsclc_cohort_table",
    TNMs = c("TNM-M0", "TNM-N2", "TNM-N3", "TNM-T3_t4")
  ) %>%
  dplyr::compute()

tryCatch({
  CohortCharacteristics::summariseLargeScaleCharacteristics(
    cohort = cdm$stage3_treatments_adjusted,
    eventInWindow = c("drug_exposure")
  )
}, error = function(e) {
  print(e)
})
#> ℹ Summarising large scale characteristics 
#>  - getting characteristics from table drug_exposure (1 of 1)                                                             
#> <simpleError in UseMethod("settings"): no applicable method for 'settings' applied to an object of class "c('cdm_table', 'GeneratedCohortSet', 'tbl_duckdb_connection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')">

tryCatch({
  cdm$stage3_treatments_adjusted <- omopgenerics::newCohortTable(
    table = cdm$stage3_treatments_adjusted
  )
}, error = function(e) {
  print(e)
})
#> <simpleError in insertTable.db_cdm(cdm = tableSource(table), name = name, table = cohortSetRef,     overwrite = TRUE): Assertion on 'name' failed: Contains missing values (element 1).>

class(cdm$stage3_treatments_adjusted)
#> [1] "cdm_table"             "GeneratedCohortSet"    "tbl_duckdb_connection"
#> [4] "tbl_dbi"               "tbl_sql"               "tbl_lazy"             
#> [7] "tbl"

Created on 2024-07-12 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.0 (2024-04-24 ucrt) #> os Windows 11 x64 (build 22631) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Dutch_Netherlands.utf8 #> ctype Dutch_Netherlands.utf8 #> tz Europe/Amsterdam #> date 2024-07-12 #> pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.4.0) #> CDMConnector * 1.4.0 2024-05-03 [1] CRAN (R 4.4.1) #> checkmate 2.3.1 2023-12-04 [1] CRAN (R 4.4.0) #> CirceR 1.3.3 2024-04-18 [1] CRAN (R 4.4.1) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1) #> CohortCharacteristics * 0.2.1 2024-06-04 [1] CRAN (R 4.4.1) #> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.4.1) #> dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.4.0) #> digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0) #> duckdb 1.0.0-1 2024-07-10 [1] CRAN (R 4.4.1) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0) #> knitr 1.47 2024-05-29 [1] CRAN (R 4.4.0) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0) #> omopgenerics * 0.2.2 2024-06-19 [1] CRAN (R 4.4.1) #> PatientProfiles * 1.1.0 2024-06-11 [1] CRAN (R 4.4.1) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0) #> R.cache 0.16.0 2022-07-21 [3] CRAN (R 4.4.0) #> R.methodsS3 1.8.2 2022-06-13 [3] CRAN (R 4.4.0) #> R.oo 1.26.0 2024-01-24 [3] CRAN (R 4.4.0) #> R.utils 2.12.3 2023-11-18 [3] CRAN (R 4.4.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0) #> readr 2.1.5 2024-01-10 [1] CRAN (R 4.4.0) #> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.4.1) #> D rJava 1.0-11 2024-01-26 [1] CRAN (R 4.4.0) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0) #> rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) #> snakecase 0.11.1 2023-08-27 [1] CRAN (R 4.4.0) #> SqlRender 1.18.0 2024-05-30 [1] CRAN (R 4.4.1) #> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.4.0) #> styler 1.10.3 2024-04-07 [3] CRAN (R 4.4.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.4.0) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0) #> xfun 0.44 2024-05-15 [1] CRAN (R 4.4.0) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0) #> #> [1] C:/Users/mvankessel/AppData/Local/R/cache/R/renv/library/P2C3003Characterisation-9d338b7a/windows/R-4.4/x86_64-w64-mingw32 #> [2] C:/Users/mvankessel/AppData/Local/R/cache/R/renv/sandbox/windows/R-4.4/x86_64-w64-mingw32/88979f7b #> [3] C:/R/R-4.4.0/library #> #> D ── DLL MD5 mismatch, broken installation. #> #> ────────────────────────────────────────────────────────────────────────────── ```
ablack3 commented 2 months ago

I think this bit of code in omopgenerics needs a second look:

populateCohortSet <- function(table, cohortSetRef) {
  if (is.null(cohortSetRef)) {
    cohortSetRef <- defaultCohortSet(table)
  } else {
    cohortSetRef <- cohortSetRef |> dplyr::collect()
  }
  cohortName <- tableName(table)
  assertClass(cohortSetRef, "data.frame", null = TRUE)
  cohortSetRef <- dplyr::as_tibble(cohortSetRef)
  name <- ifelse(is.na(cohortName), cohortName, paste0(cohortName, "_set"))
  cohortSetRef <- insertTable(
    cdm = tableSource(table), name = name, table = cohortSetRef,
    overwrite = TRUE
  )
  return(cohortSetRef)
}

If the cohortName is NA the it is still being passed to insertTable

What is the cohortName attribute when someone calls compute with temporary=T on a cdm table? NA_character

Maybe we give an error if the table name is NA (indicating a temp table).

library(CDMConnector)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, "main", "main")

cs <- read_cohort_set(system.file("cohorts2", package = "CDMConnector"))

cdm <- generate_cohort_set(cdm, cs)

tbl <- cdm$cohort %>% 
  dplyr::filter(subject_id %in% c(951L, 2164L)) %>% 
  compute(temporary = T) 

attr(tbl, "tbl_name")
#> [1] NA

class(attr(tbl, "tbl_name"))
#> [1] "character"

is.na(attr(tbl, "tbl_name"))
#> [1] TRUE

tbl %>% 
  record_cohort_attrition("reason")
#> Error in insertTable.db_cdm(cdm = tableSource(table), name = name, table = cohortSetRef, : Assertion on 'name' failed: Contains missing values (element 1).

cdmDisconnect(cdm)

Created on 2024-07-16 with reprex v2.1.0

The error is because the tbl_name attribute is NA

mvankessel-EMC commented 2 months ago

A work around for this would be:

# Strip "GeneratedCohortSet" from class attribute
class(cdm$my_cohort_table) <- c("cdm_table", "tbl_duckdb_connection", "tbl_dbi", "tbl_sql", "tbl_lazy", "tbl")

# Make new cohort table
cdm$my_cohort_table<- omopgenerics::newCohortTable(table = cdm$my_cohort_table)
catalamarti commented 2 months ago

https://github.com/darwin-eu-dev/omopgenerics/issues/413