generateConceptCohortSet behavior with limit = "first"

raivokolde commented 2 months ago

In Atlas, when specifying we want a first event as index date and required observation period at least 1 year before, we get all the people that had the index event at least 1 year after the start of observation period AND did not have the same event happening in the 1st year of observation. Which is incredibly useful, as we do not want to mix incident and prevalent cases in most of the analyses.

To my surprise generateConceptCohortSet did not behave this way when setting limit = "first". Instead, it just starts the cohort on the first suitable event after the required observation period WITHOUT checking if it happened before or not. This IMHO is the wrong behaviour. Especially coming from Atlas cohorts you do not expect it and its pretty difficult to detect as well.

I would appreciate, if the default behaviour would be the same as in Atlas or at least there would be an option to get similar behaviour.

ablack3 commented 1 month ago

The intention is indeed to match Atlas behavior. I want to do some testing though because I thought that limit="first" in Atlas matches the first event after the required observation time which could in some cases not be the first event in a person's history. There is an attribute that can be added in Atlas to require the event is the first in the person's history.

If you have a test example please post it.

ablack3 commented 1 month ago

In Atlas, when specifying we want a first event as index date and required observation period at least 1 year before, we get all the people that had the index event at least 1 year after the start of observation period AND did not have the same event happening in the 1st year of observation.

In the example below I found a person with an event that occurs on 1965-06-23 and again on 1966-10-03. Their observation time starts on 1963-12-31.

The event first happens on day 540 and then again on day 1007 if we start counting from observation period start. In atlas if I have a cohort that requires 730 days of prior observation and use limit = "first" I will capture the event starting on day 1007 even though there is a prior event. So I don't think Atlas will guarantee that the event did not happen in the required observation time unless you add that requirement as an inclusion criteria.

Here is a code example.

library(CDMConnector)
library(dplyr)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, "main", "main", cdm_name = "test", .soft_validation = T)

cdm$person %>% 
  dplyr::filter(person_id == 6)
#> # Source:   SQL [1 x 18]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.1.0:R 4.3.3//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/Rtmp4tWcbH/file107bc2706194c.duckdb]
#>   person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#>       <int>             <int>         <int>          <int>        <int>
#> 1         6              8532          1963             12           31
#> # ℹ 13 more variables: birth_datetime <dttm>, race_concept_id <int>,
#> #   ethnicity_concept_id <int>, location_id <int>, provider_id <int>,
#> #   care_site_id <int>, person_source_value <chr>, gender_source_value <chr>,
#> #   gender_source_concept_id <int>, race_source_value <chr>,
#> #   race_source_concept_id <int>, ethnicity_source_value <chr>,
#> #   ethnicity_source_concept_id <int>

# This person's observation time starts on 1963-12-31 
cdm$observation_period %>% 
  dplyr::filter(person_id == 6)
#> # Source:   SQL [1 x 5]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.1.0:R 4.3.3//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/Rtmp4tWcbH/file107bc2706194c.duckdb]
#>   observation_period_id person_id observation_period_st…¹ observation_period_e…²
#>                   <int>     <int> <date>                  <date>                
#> 1                     6         6 1963-12-31              2007-02-06            
#> # ℹ abbreviated names: ¹observation_period_start_date,
#> #   ²observation_period_end_date
#> # ℹ 1 more variable: period_type_concept_id <int>

cdm$condition_occurrence %>% 
  dplyr::filter(person_id == 6, condition_concept_id == 372328) %>% 
  dplyr::collect() %>% 
  dplyr::arrange(condition_start_date)
#> # A tibble: 3 × 16
#>   condition_occurrence_id person_id condition_concept_id condition_start_date
#>                     <int>     <int>                <int> <date>              
#> 1                     144         6               372328 1965-06-23          
#> 2                     138         6               372328 1966-10-03          
#> 3                     139         6               372328 1969-12-20          
#> # ℹ 12 more variables: condition_start_datetime <dttm>,
#> #   condition_end_date <date>, condition_end_datetime <dttm>,
#> #   condition_type_concept_id <int>, condition_status_concept_id <int>,
#> #   stop_reason <chr>, provider_id <int>, visit_occurrence_id <int>,
#> #   visit_detail_id <int>, condition_source_value <chr>,
#> #   condition_source_concept_id <int>, condition_status_source_value <chr>

# 372328 occurs on 1965-06-23 and again on 1966-10-03
# there is
difftime(as.Date("1965-06-23"), as.Date("1963-12-31"))
#> Time difference of 540 days
difftime(as.Date("1966-10-03"), as.Date("1963-12-31"))
#> Time difference of 1007 days

# let's require 2 years of prior observation time and use limit = "first"

# https://atlas-demo.ohdsi.org/#/cohortdefinition/1790987

{json <- c('
{
  "ConceptSets": [
    {
      "id": 0,
      "name": "Otitis media",
      "expression": {
        "items": [
          {
            "concept": {
              "CONCEPT_CLASS_ID": "Disorder",
              "CONCEPT_CODE": "65363002",
              "CONCEPT_ID": 372328,
              "CONCEPT_NAME": "Otitis media",
              "DOMAIN_ID": "Condition",
              "INVALID_REASON": "V",
              "INVALID_REASON_CAPTION": "Valid",
              "STANDARD_CONCEPT": "S",
              "STANDARD_CONCEPT_CAPTION": "Standard",
              "VOCABULARY_ID": "SNOMED"
            }
          }
        ]
      }
    }
  ],
  "PrimaryCriteria": {
    "CriteriaList": [
      {
        "ConditionOccurrence": {
          "CodesetId": 0
        }
      }
    ],
    "ObservationWindow": {
      "PriorDays": 730,
      "PostDays": 0
    },
    "PrimaryCriteriaLimit": {
      "Type": "First"
    }
  },
  "QualifiedLimit": {
    "Type": "First"
  },
  "ExpressionLimit": {
    "Type": "First"
  },
  "InclusionRules": [],
  "CensoringCriteria": [],
  "CollapseSettings": {
    "CollapseType": "ERA",
    "EraPad": 0
  },
  "CensorWindow": {},
  "cdmVersionRange": ">=5.0.0"
}         
')
}

# write this to a temp folder and generate it using Circe (Atlas SQL)

cohort_folder <- tempfile()

dir.create(cohort_folder)

readr::write_file(json, file.path(cohort_folder, "cohort.json"))
list.files(cohort_folder)
#> [1] "cohort.json"

cohort_set <- read_cohort_set(cohort_folder)

cdm <- generate_cohort_set(cdm, cohort_set, name = "cohort1")
#> ℹ Generating 1 cohort
#> ℹ Generating cohort (1/1) - cohort
#> ✔ Generating cohort (1/1) - cohort [175ms]
#> 
#> Warning: ! 5 casted column in cohort1 (cohort_attrition) as do not match expected column
#>   type:
#> • `number_records` from numeric to integer
#> • `number_subjects` from numeric to integer
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 column in cohort1 do not match expected column type:
#> • `subject_id` is numeric but expected integer

# look at the cohort table

cdm$cohort1 %>% 
  dplyr::filter(subject_id == 6)
#> # Source:   SQL [1 x 4]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.1.0:R 4.3.3//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/Rtmp4tWcbH/file107bc2706194c.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <dbl> <date>            <date>         
#> 1                    1          6 1966-10-03        2007-02-06

# so as you can see the cohort picks up the second occurrence of 372328 on 1966-10-03

# Now lets compare with generate_concept_cohort_set

# in the code below, limit refers to the   "PrimaryCriteriaLimit": {"Type": "First"} in the Circe json
cdm <- generate_concept_cohort_set(
  cdm, 
  name = "cohort2",
  concept_set = list(cohort2 = 372328),
  required_observation = c(730, 0),
  limit = "first")
#> Warning: ! 3 casted column in cohort2 (cohort_attrition) as do not match expected column
#>   type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort2 (cohort_codelist) as do not match expected column
#>   type:
#> • `concept_id` from numeric to integer

cdm$cohort2 %>% 
  filter(subject_id == 6)
#> # Source:   SQL [1 x 4]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.1.0:R 4.3.3//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/Rtmp4tWcbH/file107bc2706194c.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <int> <date>            <date>         
#> 1                    1          6 1966-10-03        2007-02-06

# This gives the same result

cdm_disconnect(cdm)

^{Created on 2024-10-16 with reprex v2.1.1}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.3 (2024-02-29) #> os macOS Sonoma 14.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Amsterdam #> date 2024-10-16 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.3.3) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.0) #> CDMConnector * 1.5.0 2024-07-16 [1] CRAN (R 4.3.3) #> checkmate 2.3.2 2024-07-29 [1] CRAN (R 4.3.3) #> CirceR 1.3.3 2024-04-18 [1] CRAN (R 4.3.1) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.3.3) #> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.3.3) #> dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.3.1) #> digest 0.6.37 2024-08-19 [1] CRAN (R 4.3.3) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1) #> duckdb 1.0.0-2 2024-07-19 [1] CRAN (R 4.3.3) #> evaluate 1.0.0 2024-09-17 [1] CRAN (R 4.3.3) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.3) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> glue 1.8.0 2024-09-30 [1] CRAN (R 4.3.3) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1) #> jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.3.3) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.3.3) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> omopgenerics 0.3.1 2024-09-21 [1] CRAN (R 4.3.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.1) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.3.3) #> rJava 1.0-11 2024-01-26 [1] CRAN (R 4.3.1) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) #> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.3.3) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> snakecase 0.11.1 2023-08-27 [1] CRAN (R 4.3.0) #> SqlRender 1.18.1 2024-08-21 [1] CRAN (R 4.3.3) #> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.1) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> withr 3.0.1 2024-07-31 [1] CRAN (R 4.3.3) #> xfun 0.48 2024-10-03 [1] CRAN (R 4.3.3) #> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.3.3) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

@raivokolde what do you think of the code example above? Am I missing something?

Here is what the cohort definition looks like in Atlas.

You might be thinking of the attribute that requires the first occurrence in a person's history.

This attribute is not implemented in generate_concept_cohort_set, although it would probably be useful. Other packages implementing code based cohorts are https://ohdsi.github.io/CohortConstructor/reference/index.html and https://ohdsi.github.io/Capr/

ablack3 commented 4 weeks ago

I think the reprex above shows that generateConceptCohortSet works the same way that Atlas does with respect to limit. @raivokolde You can reopen if you disagree. Just double check your expectation that when you limit initial events to "earliest event" and require 365 days of prior observation that you will never capture index dates where people have the event in the year prior. I think that you need add that logic in an inclusion criteria.

darwin-eu / CDMConnector

generateConceptCohortSet behavior with limit = "first" #27