OHDSI / CohortGenerator

Cohort Generation for the OMOP Common Data Model
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

Unable to subset cohort based on 'during target cohort' #175

Open chrisknoll opened 1 month ago

chrisknoll commented 1 month ago

The current cohort subset logic provides a time window logic like:

.priorEpilepsy <- CohortGenerator::createCohortSubset(
  name = "Prior Epilepsy",
  cohortIds = 12403, #use epilepsy cohort
  startWindow = CohortGenerator::createSubsetCohortWindow(
    startDay = -365,
    endDay = 30,
    targetAnchor = "cohortStart"
  ),
  endWindow = CohortGenerator::createSubsetCohortWindow(
    startDay = -9999,
    endDay = 9999,
    targetAnchor = "cohortStart"
  ),
  cohortCombinationOperator = "all",
  negate = F
)

This says the subset cohort starts within 365d of target cohort start date, and the subset cohort ends any time relative to the target's cohort start.

But how would we implment the logic: 'The subset cohort starts between 0 and 9999 days of the target cohort start , and the subset cohort starts between -9999 and 0 days of the target cohort end'?

The issue is that the start window talks about the subset cohort start date, the end window talks about the subset cohort end date, but how do you talk about the subset cohort start date twice (once about the cohort start and once about the cohort end).

The limitation seems to prevent us from doing any rule that says 'subset cohort starts during the target cohort'.

anthonysena commented 1 month ago

'subset cohort starts during the target cohort'.

Using your example from above, wouldn't this work whereby we'd want to find the subset cohort that starts on or after the target cohort start and also ends on after the target cohort start:

.priorEpilepsy <- CohortGenerator::createCohortSubset(
  name = "Prior Epilepsy",
  cohortIds = 12403, #use epilepsy cohort
  startWindow = CohortGenerator::createSubsetCohortWindow(
    startDay = 0,
    endDay = 9999,
    targetAnchor = "cohortStart"
  ),
  endWindow = CohortGenerator::createSubsetCohortWindow(
    startDay = 0,
    endDay = 9999,
    targetAnchor = "cohortStart"
  ),
  cohortCombinationOperator = "all",
  negate = F
)

Let me know if I've misunderstood the use-case here.

chrisknoll commented 1 month ago

No, your example is 'The target cohort starts anytime after the subset cohort starts and the target cohort starts anytime after subset cohort ends. So this would pass:

Subset Cohort:                               |--------------------|
Target Cohort:    |------------|

This is what we want to identify:

Subset Cohort:             |--------------------|
Target Cohort:    |------------|

The statement: The subset starts between 0 and 999 days after Target start and the subset starts between -999 and 0 of the target end.

The key is being able to make 2 statements about the subset cohort start, but the way the API works, your startWindow talks about the subset start date, and the endWindow talks about the subset end date.

Of course this isn't possible because startWindow can only be specified once, but the point is I need to say that the subset cohort's start is after the target's start and before the target's end. (2 statements about subset cohort start date, not one statement about the start and one statement about the end).

In the above example, it's a little confusing to say "Prior Epilepsy" but if we said 'Epilepsy During' it might make more sense: Epilepsy During means that you saw the start of the epilepsy after your target's start date, and you also saw the epilepsy before the target's end.

Note, with the current API, you could do this:

CohortGenerator::createCohortSubset(
  name = "Prior Epilepsy",
  cohortIds = 12403, #use epilepsy cohort
  startWindow = CohortGenerator::createSubsetCohortWindow(
    startDay = 0,
    endDay = 9999,
    targetAnchor = "cohortStart"
  ),
  startWindow= CohortGenerator::createSubsetCohortWindow(
    startDay = -9999,
    endDay = 0,
    targetAnchor = "cohortEnd"
  ),
  cohortCombinationOperator = "all",
  negate = F
)

But that would only match on this:

Subset Cohort:             |----------|
Target Cohort:    |-------------------------------|

Where the subset starts after T, but also ends before T's end...but that's not what we really want here. That's because this would not match (from the first example):

Subset Cohort:             |--------------------|
Target Cohort:    |------------|
anthonysena commented 1 month ago

Thanks @chrisknoll for this detail it helps to illustrate the problem. You are right that we cannot model the 'subset cohort starts during the target cohort' with the current API. We can do one of:

Subset Cohort:             |----------|
Target Cohort:    |-------------------------------|

OR

Subset Cohort:             |--------------------|
Target Cohort:    |------------|

But not both in the same cohort subset definition. Here is a reprex "sandbox" that shows the 2nd case here whereby the subset overlaps the target cohort end:

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver

sql <- "
  DELETE FROM main.cohort;

  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 1, 1, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');

  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 1, 2, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');

  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 1, 3, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');

  -- Starts in target but ends after
  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 2, 1, strftime('%s', '2001-01-01'), strftime('%s', '2002-12-31');

  -- No overlap with target
  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 2, 2, strftime('%s', '2003-01-01'), strftime('%s', '2004-12-31');

  -- Subsumed by target
  INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE) 
  SELECT 2, 3, strftime('%s', '2000-02-01'), strftime('%s', '2001-01-01');
"

DatabaseConnector::executeSql(
  connection = conn,
  sql = sql
)
#> Executing SQL took 0.15 secs

sql <- "SELECT * FROM main.cohort ORDER BY subject_id, cohort_definition_id;"
rs <- DatabaseConnector::querySql(
  connection = conn,
  sql = sql
)
rs
#>   COHORT_DEFINITION_ID SUBJECT_ID COHORT_START_DATE COHORT_END_DATE
#> 1                    1          1        2000-01-01      2001-12-31
#> 2                    2          1        2001-01-01      2002-12-31
#> 3                    1          2        2000-01-01      2001-12-31
#> 4                    2          2        2003-01-01      2004-12-31
#> 5                    1          3        2000-01-01      2001-12-31
#> 6                    2          3        2000-02-01      2001-01-01

cs <- CohortGenerator::createCohortSubsetDefinition(
  name = "Test Cohort Subset",
  definitionId = 1,
  subsetOperators = list(
    CohortGenerator::createCohortSubset(
      name = "Overlap End Date",
      cohortIds = 2,
      startWindow = CohortGenerator::createSubsetCohortWindow(
        startDay = -99999,
        endDay = 0,
        targetAnchor = "cohortEnd"
      ),
      endWindow = CohortGenerator::createSubsetCohortWindow(
        startDay = 0,
        endDay = 99999,
        targetAnchor = "cohortEnd"
      ),
      cohortCombinationOperator = "all",
      negate = F
    )
  )
)

sql <- cs$getSubsetQuery(targetOutputPair = c(1, 1001))
sql <- SqlRender::render(
  sql = sql, 
  cohort_database_schema = "main",
  cohort_table = "cohort")
sql <- SqlRender::translate(
  sql = sql,
  targetDialect = "sqlite"
)
DatabaseConnector::executeSql(
  connection = conn,
  sql = sql
)
#> Executing SQL took 0.02 secs

# Write SQL to file to view
# SqlRender::writeSql(
#   sql = sql,
#   targetFile = "subsetQuery.sql"
# )

sql <- "SELECT * FROM main.cohort ORDER BY subject_id, cohort_definition_id;"
rs <- DatabaseConnector::querySql(
  connection = conn,
  sql = sql
)
rs
#>   COHORT_DEFINITION_ID SUBJECT_ID COHORT_START_DATE COHORT_END_DATE
#> 1                    1          1        2000-01-01      2001-12-31
#> 2                 1001          1        2000-01-01      2001-12-31
#> 3                    2          1        2001-01-01      2002-12-31
#> 4                    1          2        2000-01-01      2001-12-31
#> 5                    2          2        2003-01-01      2004-12-31
#> 6                    1          3        2000-01-01      2001-12-31
#> 7                    2          3        2000-02-01      2001-01-01

DatabaseConnector::disconnect(conn)

Created on 2024-08-06 with reprex v2.1.0

As shown above, subject_id == 3 is not in the subset cohort with cohort_definition_id == 1001 since it doesn't overlap which is what you had pointed out (but in the reverse in this example).

We'd have to break the API to allow for specifying the subset start/end as part of this cohort subset definition, best I can tell, unless you have a suggestion on how to change this and maintain backwards compatibility?

Tagging @azimov for awareness.

chrisknoll commented 1 month ago

Yes, in order to keep backwards compatabiilty, we need to leave the existing stuff alone and let new stuff come in without disturbing the existing.

here's a thought:

As part of the subset operator, we add a field called windows that can be a collection of the cohort windows, and we convert those to a series of ANDs in the final SQL:

CohortGenerator::createCohortSubset(
  name = "Incident Epilepsy",
  cohortIds = 12403, #use epilepsy cohort
  windows = list(
    CohortGenerator::createSubsetCohortWindow(
      startDay = 0,
      endDay = 9999,
      targetAnchor = "cohortStart",
      anchor = "cohortStart"
    ),
    CohortGenerator::createSubsetCohortWindow(
      startDay = -9999,
      endDay = 0,
      targetAnchor = "cohortEnd",
      anchor = "cohortStart"
    )
  ),
  cohortCombinationOperator = "all",
  negate = F
)

So the list of elements in windows gets converted to the date diff + start/end date of the target and subset cohort.

The chagne would be createSubsetCohortWindow should take an additional parameter for specifying the subsetCohort's anchor.

I think this could be an easy modification where int he cases where you're reading the window from startWindow and endWindow, you force the 'anchor' to be cohrotStart or cohrotEnd (to keep the current behavior). Else just use the windows specified in the windows list to create the WHERE clause. And this doesn't need to be an else, the logic of the SQL builder could build what's found in startWindow/endWindow and then add the other ANDs based on the list that is found in windows. I am not saying that people would or should specify things that way, just that it would work in a backwards compatible way.