OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
890 stars 451 forks source link

Possibly simplify drug_era SQL #464

Open ablack3 opened 2 years ago

ablack3 commented 2 years ago

I've been looking at the drug_era SQL and may have found a small simplification that always gives the same results. The query I'm looking at is the one that identifies the end dates of each ingredient exposure. I have compared my new query to the existing query on every case of two ingredient exposures that I can think of. I would like to be able to claim that if the query gives correct results for any set of two ingredient exposures then it will work for any number of ingredient exposures but I don't know if I can make that claim. Since the entire query is partitioned by ingredient_id and person_id I don't multiple ingredients or multiple persons will alter the correctness of the results. The algorithm runs on person-ingredient combinations and each person-ingredient combo is independent of all the others. Some assumptions I'm making about the input are that start date and end dates are populated and end date is at least 30 days after start date. I think these assumptions are guaranteed by the SQL statement that precedes the end date identification.

Here is my reprex:

library(dplyr)
library(RSQLite)

new_query <- function(ingredient_exposure){
  con <- dbConnect(SQLite(), ":memory:")
  ingredient_exposure <- as.data.frame(ingredient_exposure)
  dbWriteTable(con, DBI::SQL("temp.cteDrugTarget"), ingredient_exposure)

  sql <- "
  SELECT PERSON_ID
  ,INGREDIENT_CONCEPT_ID
  ,DATEADD(day, - 30, EVENT_DATE) AS END_DATE -- unpad the end date
  --INTO #cteEndDates
  FROM (
      SELECT E1.PERSON_ID
          ,E1.INGREDIENT_CONCEPT_ID
          ,E1.EVENT_DATE
          ,MAX(E2.START_ORDINAL) AS START_ORDINAL
          ,E1.OVERALL_ORD
      FROM (
          SELECT PERSON_ID
              ,INGREDIENT_CONCEPT_ID
              ,EVENT_DATE
              ,ROW_NUMBER() OVER (
                  PARTITION BY PERSON_ID
                  ,INGREDIENT_CONCEPT_ID ORDER BY EVENT_DATE
                  ) AS OVERALL_ORD -- this numbers the inner UNION so all rows are numbered ordered by the event date
          FROM (
              SELECT PERSON_ID
                  ,INGREDIENT_CONCEPT_ID
                  ,DRUG_EXPOSURE_START_DATE AS EVENT_DATE
              FROM #cteDrugTarget

              UNION ALL

              SELECT PERSON_ID
                  ,INGREDIENT_CONCEPT_ID
                  ,DATEADD(day, 30, DRUG_EXPOSURE_END_DATE)
              FROM #cteDrugTarget
              ) RAWDATA
          ) E1
      INNER JOIN (
          SELECT PERSON_ID
              ,INGREDIENT_CONCEPT_ID
              ,DRUG_EXPOSURE_START_DATE AS EVENT_DATE
              ,ROW_NUMBER() OVER (
                  PARTITION BY PERSON_ID
                  ,INGREDIENT_CONCEPT_ID ORDER BY DRUG_EXPOSURE_START_DATE
                  ) AS START_ORDINAL
          FROM #cteDrugTarget
          ) E2 ON E1.PERSON_ID = E2.PERSON_ID
          AND E1.INGREDIENT_CONCEPT_ID = E2.INGREDIENT_CONCEPT_ID
          AND E2.EVENT_DATE <= E1.EVENT_DATE
      GROUP BY E1.PERSON_ID
          ,E1.INGREDIENT_CONCEPT_ID
          ,E1.EVENT_DATE
          ,E1.OVERALL_ORD
      ) E
  WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0;
  "

  sql <- SqlRender::translate(sql, "sqlite")

  result <- dbGetQuery(con, sql) %>%
    mutate_at(vars(matches("DATE")), ~as.Date(., origin = "1970-01-01"))
  dbDisconnect(con)
  result
}

current_query <- function(ingredient_exposure){
  con <- dbConnect(SQLite(), ":memory:")
  ingredient_exposure <- as.data.frame(ingredient_exposure)
  dbWriteTable(con, DBI::SQL("temp.cteDrugTarget"), ingredient_exposure)

  sql <- "
  SELECT PERSON_ID
  ,INGREDIENT_CONCEPT_ID
  ,DATEADD(day, - 30, EVENT_DATE) AS END_DATE -- unpad the end date
  --INTO #cteEndDates
  FROM (
    SELECT E1.PERSON_ID
    ,E1.INGREDIENT_CONCEPT_ID
    ,E1.EVENT_DATE
    ,COALESCE(E1.START_ORDINAL, MAX(E2.START_ORDINAL)) AS START_ORDINAL
    ,E1.OVERALL_ORD
    FROM (
      SELECT PERSON_ID
      ,INGREDIENT_CONCEPT_ID
      ,EVENT_DATE
      ,EVENT_TYPE
      ,START_ORDINAL
      ,ROW_NUMBER() OVER (
        PARTITION BY PERSON_ID
        ,INGREDIENT_CONCEPT_ID ORDER BY EVENT_DATE
        ,EVENT_TYPE
      ) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
      FROM (
        -- select the start dates, assigning a row number to each
        SELECT PERSON_ID
        ,INGREDIENT_CONCEPT_ID
        ,DRUG_EXPOSURE_START_DATE AS EVENT_DATE
        ,0 AS EVENT_TYPE
        ,ROW_NUMBER() OVER (
          PARTITION BY PERSON_ID
          ,INGREDIENT_CONCEPT_ID ORDER BY DRUG_EXPOSURE_START_DATE
        ) AS START_ORDINAL
        FROM #cteDrugTarget

        UNION ALL

        -- add the end dates with NULL as the row number, padding the end dates by 30 to allow a grace period for overlapping ranges.
        SELECT PERSON_ID
        ,INGREDIENT_CONCEPT_ID
        ,DATEADD(day, 30, DRUG_EXPOSURE_END_DATE)
        ,1 AS EVENT_TYPE
        ,NULL
        FROM #cteDrugTarget
      ) RAWDATA
    ) E1
    INNER JOIN (
      SELECT PERSON_ID
      ,INGREDIENT_CONCEPT_ID
      ,DRUG_EXPOSURE_START_DATE AS EVENT_DATE
      ,ROW_NUMBER() OVER (
        PARTITION BY PERSON_ID
        ,INGREDIENT_CONCEPT_ID ORDER BY DRUG_EXPOSURE_START_DATE
      ) AS START_ORDINAL
      FROM #cteDrugTarget
    ) E2 ON E1.PERSON_ID = E2.PERSON_ID
    AND E1.INGREDIENT_CONCEPT_ID = E2.INGREDIENT_CONCEPT_ID
    AND E2.EVENT_DATE <= E1.EVENT_DATE
    GROUP BY E1.PERSON_ID
    ,E1.INGREDIENT_CONCEPT_ID
    ,E1.EVENT_DATE
    ,E1.START_ORDINAL
    ,E1.OVERALL_ORD
  ) E
  WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0;"

  sql <- SqlRender::translate(sql, "sqlite")

  result <- dbGetQuery(con, sql) %>%
    mutate_at(vars(matches("DATE")), ~as.Date(., origin = "1970-01-01"))
  dbDisconnect(con)
  result
}

# Case 1: start 1 < start 2 < end 1 < end 2
case1 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-01-31",
  1,          1,                      "2020-01-02",             "2020-02-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case1)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-02-01
new_query(case1)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-02-01

# Case 2: start 1 < start 2 < end 2 < end 1 
case2 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-02-28",
  1,          1,                      "2020-01-02",             "2020-02-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case2)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-02-28
new_query(case2)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-02-28

# Case 3: start 1 < end 1 < start 2 < end 2
case3 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-01-31",
  1,          1,                      "2020-02-01",             "2020-03-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case3)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01
new_query(case3)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01

# Case 4: start 1 == start 2 < end 1 < end 2
case4 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-03-01",
  1,          1,                      "2020-01-01",             "2020-02-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case4)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01
new_query(case4)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01

# Case 5: start 1 == start 2 < end 1 == end 2
case5 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-03-01",
  1,          1,                      "2020-01-01",             "2020-03-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case5)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01
new_query(case5)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01

# Case 6: start 1 < start 2 < end 1 == end 2
case6 <- tribble(
  ~PERSON_ID, ~INGREDIENT_CONCEPT_ID, ~DRUG_EXPOSURE_START_DATE, ~DRUG_EXPOSURE_END_DATE,
  1,          1,                      "2020-01-01",             "2020-03-01",
  1,          1,                      "2020-02-01",             "2020-03-01") %>% 
  mutate_at(vars(matches("DATE")), as.Date)

current_query(case6)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01
new_query(case6)
#>   PERSON_ID INGREDIENT_CONCEPT_ID   END_DATE
#> 1         1                     1 2020-03-01

Created on 2021-11-27 by the reprex package (v2.0.1)

Is it possible for these two queries to give different results as long as the assumptions about the input are correct?

ablack3 commented 1 year ago

@pbr6cornell - Here is a possible simplification I proposed to the era SQL logic. I finally closed it after >6 months with no interest. Could be that my code was incorrect though. I'm not sure if anyone looked at it.

pbr6cornell commented 1 year ago

Thanks @ablack3, really glad to have your critical thinking on this, and sorry that I missed your earlier post. If there's a more efficient approach to building drug_eras then I think we should all be eager to take advantage of that. It's a bit unclear to me procedurally where we'd want to share this insights. Looking to others for help. @clairblacketer @MelaniePhilofsky @chrisknoll . My immediate thought, that I'm not wedded to, is this may be the jurisdiction of the newly re-emerged THEMIS workgroup, who could decide on community convention for how to build eras, and it'd seem reasonable to recommend a codebase in the CDM repo as a best practice. I don't think there's a need to enforce that people use the same code to build eras, though clearly there should be a strong incentive if there's already efficient code with nice unit tests that is available, not sure why anyone would willingly want to re-invent the wheel (though I know some ETL processes are more conducive to processing eras one-person-at-a-time rather than bulk, so some modifications may need to be made).

chrisknoll commented 1 year ago

I'm sorry I didn't see this either, it is possibly i'm only subscribed to CommonDataModel issues where I am tagged.

I looked it over a few times but I'm having trouble seeing the difference in the code. However, there has been some changes to the 'era builder' logic that Patrick came up with that we applied into the CohortIncidence module, you can see it here.

Explanation:

The current form you have uses a INNER JOIN and the calculation of WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0 to determine the end dates. Patrick noticed that we can simply find the end dates by using the event type (-1 for starts, 1 for end) to do a SUM() of the prior records that will subtract 1 for the start, and increment for the ends...leading to the case where when your SUM(prior) = 0, you've 'closed' all the prior starts. This prevents the need to yield the start_ord / overall_ord logic which incurs this cost of a join, and instead just take all starts and ends, line them up and then add -1 or +1 for each row to find the 'true ends' (sum() = 0).

So, while the form linked above leverages CTEs, it can be written to use subqueries but I think the modification Patrick came up with simplifies it very nicely.

ablack3 commented 1 year ago

Oh nice! It has been a while since I've look at this. I think I was able to remove one subquery and was thinking it might extend to other places where the era logic is used in the OHDSI codebase. Patrick's simplification looks better though at first glance. I'll need to spend some time to look over it.

I've also been thinking about how to test this type of query to guarantee it always produces correct results on all dbms. I recently added union and intersection functions (written with dplyr) to the Darwin R package I'm working on here and tests here. FYI.

ablack3 commented 1 year ago

And sorry for not tagging someone earlier. I'm not always sure where to put things. The OHDSI/Odysseus communication surface area is quite large!

ablack3 commented 1 year ago

Playing around with this idea for union... Seems promising but there are two cases where it fails ("starts" and "finished by"). Are analytic functions (lag and lead) more efficient than union all?

union <- function(con, rel) {
  intervals <- tibble::tribble( 
    ~relationship,  ~definition_id,        ~person_id,  ~start_dt,             ~end_date,
    "reference",    1,                     1,           "2022-01-05",       "2022-01-10",
    "precedes",     1,                     1,           "2022-01-01",       "2022-01-03",
    "meets",        1,                     1,           "2022-01-01",       "2022-01-04",
    "overlaps",     1,                     1,           "2022-01-01",       "2022-01-05",
    "finished_by",  1,                     1,           "2022-01-01",       "2022-01-10",
    "contains",     1,                     1,           "2022-01-01",       "2022-01-15",
    "starts",       1,                     1,           "2022-01-05",       "2022-01-07",
    "equals",       1,                     1,           "2022-01-05",       "2022-01-10",
    "reference",    1,                     2,           "2022-01-05",       "2022-01-10",
    "precedes",     1,                     2,           "2022-01-01",       "2022-01-03",
    "meets",        1,                     2,           "2022-01-01",       "2022-01-04",
    "overlaps",     1,                     2,           "2022-01-01",       "2022-01-05",
    "finished_by",  1,                     2,           "2022-01-01",       "2022-01-10",
    "contains",     1,                     2,           "2022-01-01",       "2022-01-15",
    "starts",       1,                     2,           "2022-01-05",       "2022-01-07",
    "equals",       1,                     2,           "2022-01-05",       "2022-01-10",
    "reference",    2,                     1,           "2022-01-05",       "2022-01-10",
    "precedes",     2,                     1,           "2022-01-01",       "2022-01-03",
    "meets",        2,                     1,           "2022-01-01",       "2022-01-04",
    "overlaps",     2,                     1,           "2022-01-01",       "2022-01-05",
    "finished_by",  2,                     1,           "2022-01-01",       "2022-01-10",
    "contains",     2,                     1,           "2022-01-01",       "2022-01-15",
    "starts",       2,                     1,           "2022-01-05",       "2022-01-07",
    "equals",       2,                     1,           "2022-01-05",       "2022-01-10",
    "reference",    2,                     2,           "2022-01-05",       "2022-01-10",
    "precedes",     2,                     2,           "2022-01-01",       "2022-01-03",
    "meets",        2,                     2,           "2022-01-01",       "2022-01-04",
    "overlaps",     2,                     2,           "2022-01-01",       "2022-01-05",
    "finished_by",  2,                     2,           "2022-01-01",       "2022-01-10",
    "contains",     2,                     2,           "2022-01-01",       "2022-01-15",
    "starts",       2,                     2,           "2022-01-05",       "2022-01-07",
    "equals",       2,                     2,           "2022-01-05",       "2022-01-10") |> 
    dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date)) |> 
    dplyr::filter(relationship %in% c("reference", rel)) 

  DBI::dbWriteTable(con, "intervals", intervals, overwrite = T)

  print(dplyr::tbl(con, "intervals"))

  print(tibble::tibble(DBI::dbGetQuery(con, "
  WITH cte AS (
  SELECT DISTINCT person_id, start_dt, end_date,
    CASE WHEN 
        (start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND 
        (start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
        (start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by start_dt), end_date + 1)) AND 
        (start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by start_dt), end_date + 1))
    THEN start_dt ELSE NULL END AS valid_start,
    CASE WHEN 
        (end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND 
        (end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
        (end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by start_dt), end_date + 1)) AND 
        (end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by start_dt), end_date + 1))
    THEN end_date ELSE NULL END AS valid_end from (SELECT DISTINCT person_id, start_dt, end_date FROM intervals)
  ) 
  SELECT DISTINCT a.person_id, a.valid_start AS start_dt, MIN(b.valid_end) AS end_date
  FROM cte a JOIN cte b on a.person_id = b.person_id
  WHERE a.valid_start <= b.valid_end
  GROUP BY a.person_id, a.valid_start
  ")))
}

con <- DBI::dbConnect(duckdb::duckdb())
purrr::walk(c("precedes", "meets", "overlaps", "finished_by", "contains", "starts", "equals"), ~union(con, .))
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 precedes                 1         1 2022-01-01 2022-01-03
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 precedes                 1         2 2022-01-01 2022-01-03
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 precedes                 2         1 2022-01-01 2022-01-03
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 precedes                 2         2 2022-01-01 2022-01-03
#> # A tibble: 4 × 3
#>   person_id start_dt   end_date  
#>       <dbl> <chr>      <date>    
#> 1         2 2022-01-01 2022-01-03
#> 2         2 2022-01-05 2022-01-10
#> 3         1 2022-01-01 2022-01-03
#> 4         1 2022-01-05 2022-01-10
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 meets                    1         1 2022-01-01 2022-01-04
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 meets                    1         2 2022-01-01 2022-01-04
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 meets                    2         1 2022-01-01 2022-01-04
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 meets                    2         2 2022-01-01 2022-01-04
#> # A tibble: 4 × 3
#>   person_id start_dt   end_date  
#>       <dbl> <chr>      <date>    
#> 1         2 2022-01-01 2022-01-04
#> 2         2 2022-01-05 2022-01-10
#> 3         1 2022-01-01 2022-01-04
#> 4         1 2022-01-05 2022-01-10
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 overlaps                 1         1 2022-01-01 2022-01-05
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 overlaps                 1         2 2022-01-01 2022-01-05
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 overlaps                 2         1 2022-01-01 2022-01-05
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 overlaps                 2         2 2022-01-01 2022-01-05
#> # A tibble: 2 × 3
#>   person_id start_dt   end_date  
#>       <dbl> <chr>      <date>    
#> 1         2 2022-01-01 2022-01-10
#> 2         1 2022-01-01 2022-01-10
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 finished_by              1         1 2022-01-01 2022-01-10
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 finished_by              1         2 2022-01-01 2022-01-10
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 finished_by              2         1 2022-01-01 2022-01-10
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 finished_by              2         2 2022-01-01 2022-01-10
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl>, start_dt <chr>, end_date <date>
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 contains                 1         1 2022-01-01 2022-01-15
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 contains                 1         2 2022-01-01 2022-01-15
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 contains                 2         1 2022-01-01 2022-01-15
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 contains                 2         2 2022-01-01 2022-01-15
#> # A tibble: 2 × 3
#>   person_id start_dt   end_date  
#>       <dbl> <chr>      <date>    
#> 1         2 2022-01-01 2022-01-15
#> 2         1 2022-01-01 2022-01-15
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 starts                   1         1 2022-01-05 2022-01-07
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 starts                   1         2 2022-01-05 2022-01-07
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 starts                   2         1 2022-01-05 2022-01-07
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 starts                   2         2 2022-01-05 2022-01-07
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl>, start_dt <chr>, end_date <date>
#> # Source:   table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#>   relationship definition_id person_id start_dt   end_date  
#>   <chr>                <dbl>     <dbl> <chr>      <date>    
#> 1 reference                1         1 2022-01-05 2022-01-10
#> 2 equals                   1         1 2022-01-05 2022-01-10
#> 3 reference                1         2 2022-01-05 2022-01-10
#> 4 equals                   1         2 2022-01-05 2022-01-10
#> 5 reference                2         1 2022-01-05 2022-01-10
#> 6 equals                   2         1 2022-01-05 2022-01-10
#> 7 reference                2         2 2022-01-05 2022-01-10
#> 8 equals                   2         2 2022-01-05 2022-01-10
#> # A tibble: 2 × 3
#>   person_id start_dt   end_date  
#>       <dbl> <chr>      <date>    
#> 1         2 2022-01-05 2022-01-10
#> 2         1 2022-01-05 2022-01-10
DBI::dbDisconnect(con, shutdown = T)

Created on 2023-06-12 with reprex v2.0.2

chrisknoll commented 1 year ago

Sorry, I am not clear on what you mean by UNION ALL vs. window functions....

Looking at your code, I'm seeing several different sort orders while the one Patrick has only sorts a set of days once (after extracting all starts and ends). I'm fairly certain that issuing those different sorcs across the window functions will be slower, but the only way to know is to test.

ablack3 commented 1 year ago

Just wondering if it's possible to have some intuition about which query is faster. I'll try testing them. I really only need two sorts (by start date and end date) for this algorithm. Not sure if there's a better way to write it in SQL.

Here's some advice from the bigquery documentation. Not sure if this applies broadly to all dbms though.

image
ablack3 commented 1 year ago

Some initial comparison results.

SQL Server

image

Union All looks considerably faster on sql server

Redshift

image

less of a difference on redshift.

Here's the query I'm running which probably still needs some tweaking. I'm purposely not grouping by drug ingredient right now because I want to simply see if the time interval union operation can be made faster using this approach. The time interval union operation is a fundamental building block.

WITH cte1 AS (
    SELECT DISTINCT 
      person_id, 
      MIN(drug_exposure_start_date) OVER (PARTITION BY person_id, drug_exposure_end_date) AS start_dt,
      MAX(coalesce(drug_exposure_end_date, drug_exposure_start_date)) OVER (PARTITION BY person_id, drug_exposure_start_date) AS end_date
    FROM {cdm_schema}.drug_exposure
    WHERE drug_exposure_start_date IS NOT NULL
  ),
  cte2 AS (
    SELECT DISTINCT person_id, start_dt, end_date,
      CASE WHEN 
          (start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND 
          (start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
          (start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date))) AND 
          (start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)))
      THEN start_dt ELSE NULL END AS valid_start,
      CASE WHEN 
          (end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND 
          (end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
          (end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date))) AND 
          (end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)))
      THEN end_date ELSE NULL END AS valid_end
    FROM cte1 
  ), 
  cte3 AS (
   -- could also use a join instead of a subquery. Not sure which is better.
    SELECT DISTINCT 
      person_id, 
      MAX(valid_start) OVER (partition by person_id ORDER BY valid_start ROWS UNBOUNDED PRECEDING) AS start_date,
      valid_end AS end_date
    FROM cte2
  ) 
  SELECT * FROM cte3 WHERE end_date IS NOT NULL and start_date is not null -- is there a way to remove this last subquery?
ablack3 commented 1 year ago

I looked it over a few times but I'm having trouble seeing the difference in the code.

Here is a diff with the updated code on the left and current code on the right.

image
chrisknoll commented 1 year ago

Ok, i wouldn't look at that old form, I'd use the new form described above that does the sum(event_type) to keep track of opens and closes.

I'm not sure (in the new form) where the 'self join' lies...we need to capture the set of start and end dates from the rows to erafy and then sum up the event_type to identify which dates are the 'ends' (ie: sum(event_type) = 0). Then, we need to take the min(start_date) for the rows that are grouped by the end dates and that gives us the eras. But to do that, we do need to group the start dates to the earliest end date that occurrs after each start, and then once we have that, to collapse the rows to a start-end using min(start_date) group by end_date.

But I think the part of the code you are focusing on is to determine the 'end dates'.

I'm not sure how a lead/lag window function helps us here because a era could be built across an arbitrary number of rows....we are using a window function (via the SUM() over (order by event_date)) but the step in the UNION ALL is used to combine start dates and end dates into a single list.

To illustrate the 'acoss arbitrary number of rows' case:

|-------------------------------------------------|
    |====|
                  |====|
                               |====|
                                               |====|

The figure is 'order by start_date) The first row keeps the remaining 4 rows open even tho the remaining 4 don't overlap each other. Maybe there is a way to avoid the step of merging start/end together and looking at the raw event rows and doing some sort of logic with the start/end, but I'm pretty sure I came to the conclusion of merging start/end dates handled all cases. We even see this in logic in treatment pathways.

ablack3 commented 1 year ago

Yea I see. I think you might be right. I was curious about trying to find the end_dates without needing the "union all" step. But the example you showed is a counter-example so I'll need to think about it some more. My bigger question is if we can "prove" that the algorithm will always produce the correct result. Like write a test that covers all possible cases.

For the drug_era query it seems like we can simplify it using Patrick's improvement. I'll take a stab at it.

ablack3 commented 1 year ago

@chrisknoll would this query work for getting start and end dates? It removes the join and union all. How would I test all possible cases?

with cte1 as (
  select distinct 
    person_id, 
    min(start_dt) over (partition by person_id, end_date) as start_dt,
    max(end_date) over (partition by person_id, start_dt) as end_date,
    min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
    max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
  from intervals
),
cte2 as(
  select person_id, end_date,
    case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date 
      then previous_start_dt else start_dt end as valid_start
  from cte1
)
select person_id, valid_start, max(end_date) as valid_end 
from cte2
group by person_id, valid_start
chrisknoll commented 1 year ago

Hi, @ablack3,

@chrisknoll would this query work for getting start and end dates? It removes the join and union all.

I'm not sure. I'm having a hard time locking into the logic that yields the end dates from your code. Maybe if you provided a pseudo-code type of description about what's happening? I am a little suspicious of the different partitions being applied on the different columns, but I think it's core to the concept you're presenting, I just don't follow it.

For example, the way I'd describe the 'end dates' determination from the original query, it is: Create the list of event dates by extracting start_date as event _date, -1 as event_type from starts, end_date as event date, 1 as event-type from ends, cnd sort that list by event_date, event_type, causing 'start' event dates to come before ends when they are on the same date. This has the effect of causing the starting date being accumulated in the SUM() before ends.
End dates are those dates where the sum() results in zero, ie: the -1 from the starts is matched to the +1 of the ends, and when the sum is zero, then all starts have been closed up to that point in time. These dates are 'era-end-dates'.

The next part of the algorithm (which I don't think you're interested here, but for completeness): once you know the end dates, then you can determine a start-end duration by taking the minimum (era_end_date) for any end date that occurs after the start dates, and from this result you take the min(start_date) group by era_end_date, to collapse the intermediate rows into the 'era-fied rows'.

So I think you're just focused on determining the end_dates of the eras, but I'm not clear on how the partitioning strategy you used works, so if you could explain the logic, that would help me. I may not need to know, tho, if we can create a comprehensive set of test cases that demonstrate the robustness.

How would I test all possible cases?

It would depend on what technology you want to use to run tests. On the Java side, I have a nice library that is designed for DB Unit Tests (DBUnit) which has some usefull utility classes to populate a test db schema with test data, run your tests to generate results, and then assert functions to determine if the expected results match the actual results.

If you're doing this in R, then I think a combination of DatabaseConnector loadTable() can be used to insert test cases, then you execute the test query to get a results dataframe, and then you can use core R functions to get a diff between dataframes (there is probably a library that can do this). What I like about DBUnit is that the comparisons are implements in asserts, and if the assert fails it tells you which row, column and how the data is different between the expected and actual results. Maybe in R, we would fire up a duckdb instance to host the tests and run the SQL. I wouldn't use SqlLite because I think date column support is unsatisfactory for what you're doing here.

As far as test cases? I can give it to you visually"

Case 1: Simple

|=====|

Results:

|=====|

Case 2: same starts, different ends

|---|
|------|
|---------|
|------------|
|---------------|

Results:

|---------------|

Case 3: duplicate rows should combine into a single row

|--------|
|--------|
|--------|

Results:

|--------|

Case 4: Era 'carried forward' by inner interval

|---------|
   |---------|
      |---------|
          |----------------------------------------|
                                             |---------|
                                                   |---------|
                                                         |---------|

Results:

|--------------------------------------------------------------------|

Admittedly, probably easier to provide test cases in tabular/csv form where you have the start-end dates in the input and the output start-end after collapse, but These are the types of tests I'm thinking about.

ablack3 commented 1 year ago

Wow, thanks for this detailed guidance Chris. I'll try to write up a complete response in the next few days.

ablack3 commented 1 year ago

Hi @chrisknoll,

I want to work through three topics: 1. intuition, 2. correctness, and 3. performance of the algorithm. I am indeed interested in getting both start and end dates. This is an interval union operation which forms the basis for the era-fy operation used in the drug era queries as well as circe cohort definitions. Needless to say this SQL gets executed a lot. I would say this interval union operation is more fundamental than era-fy because era-fy can be defined as pad_end_dates(gap) |> interval_union() |> unpad_end_dates(gap)

This is why I like the "functional SQL" approach taken by dplyr in R and funSQL in Julia. We can define these useful primitive operations and then compose them (|> denotes the composition operation).

Intuition

This algorithm comes from my days programming in SAS when I had to think in rowwise operations. Generally every data manipulation in SAS requires one pass through each row of the data. This makes sort order very important and in SAS you generally want to minimize the number of times sorting is performed. SQL is (supposed to be) declarative which means I just tell the database what I want as a result and don't think about how that result is obtained. However I believe in practice there are optimizations that SQL programmers use to get better performance from their queries. This knowledge has always seemed quite cryptic to me and probably highly database specific since different database engines store the data in very different ways. Regardless, my instinct is that that eliminating a join and union-all should lead to a performance gain, but we'll see.

Pseudo-pseudocode First let's deal with intervals where the start_dates are equal. To do this just group by person_id, cohort_id (or drug_ingredient_id) and start_date. For each group set the end_date = max(end_date)

Next deal with intervals where the end_dates are equal Group by person_id, cohort_id and end_date. For each group set start_date = min(start_date)

Keep only distinct combinations of person_id, cohort_id, start_date, end_date.

Now we know that we don't have duplicates or intervals where start_date or end_dates are equal.

Sort the data by start_date. Proceed top to bottom through the dataset one row at a time. On the first row set valid_start = start_date and move to the second row. On every other row, if the start_date is between the previous cumulative_min(start_date) and previous cumulative_max(end_date) then keep valid_start the same (we don't have a new interval) else set valid_start = start_date(we have a new interval)

Now we have identified all the "valid" start_dates.

To wrap up just group by person_id, cohort/ingredient_id, valid_start_date and assign valid_end_date = max(end_date)

And voila. I think that's it.

This query is my attempt to implement that algorithm in SQL. Pretty concise actually. Who know's how the database actually performs the operation though.

with cte1 as (
  select distinct 
    person_id, 
    min(start_dt) over (partition by person_id, end_date) as start_dt,
    max(end_date) over (partition by person_id, start_dt) as end_date,
    min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
    max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
  from intervals
),
cte2 as(
  select person_id, end_date,
    case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date 
      then previous_start_dt else start_dt end as valid_start
  from cte1
)
select person_id, valid_start, max(end_date) as valid_end 
from cte2
group by person_id, valid_start

I have ignored the cohort_id/ingredient_id but that can be easily added as an additional grouping variable.

Correctness

Here is my attempt to exhaustively check the correctness of this algorithm. I'm not convinced I have covered all cases so let me know if you think of anything I've missed. My approach was to check all possible relationships between two intervals as given by Allen's interval algebra and add the cases you suggested.

test_collapse <- function(test_case) {
  library(dplyr, warn.conflicts = F)

  if (test_case == "simple") {

    intervals <- tibble::tribble(
      ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
      1,                     1,           "2022-01-05",       "2022-01-10") %>%
      dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

  } else if (test_case == "same starts, different ends") {

    intervals <- tibble::tribble(
      ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
      1,                     1,           "2022-01-05",       "2022-01-10",
      1,                     1,           "2022-01-05",       "2022-01-15",
      1,                     1,           "2022-01-05",       "2022-01-17",
      1,                     1,           "2022-01-05",       "2022-01-19",
      1,                     1,           "2022-01-05",       "2022-01-20") %>%
      dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

  } else if (test_case == "duplicate rows") {

    intervals <- tibble::tribble(
      ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
      1,                     1,           "2022-01-05",       "2022-01-10",
      1,                     1,           "2022-01-05",       "2022-01-10",
      1,                     1,           "2022-01-05",       "2022-01-10",
      1,                     1,           "2022-01-05",       "2022-01-10",
      1,                     1,           "2022-01-05",       "2022-01-10") %>%
      dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

  } else if (test_case == "Era carried forward by inner interval") {

    intervals <- tibble::tribble(
      ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
      1,                     1,           "2022-01-05",       "2022-01-08",
      1,                     1,           "2022-01-07",       "2022-01-10",
      1,                     1,           "2022-01-09",       "2022-01-12",
      1,                     1,           "2022-01-11",       "2022-01-21",
      1,                     1,           "2022-01-20",       "2022-01-23",
      1,                     1,           "2022-01-22",       "2022-01-25",
      1,                     1,           "2022-01-24",       "2022-01-27") %>%
      dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

  } else {

    intervals <- tibble::tribble(
      ~relationship,  ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
      "reference",    1,                     1,           "2022-01-05",       "2022-01-10",
      "precedes",     2,                     1,           "2022-01-01",       "2022-01-03",
      "meets",        3,                     1,           "2022-01-01",       "2022-01-04",
      "overlaps",     5,                     1,           "2022-01-01",       "2022-01-05",
      "finished_by",  7,                     1,           "2022-01-01",       "2022-01-10",
      "contains",     8,                     1,           "2022-01-01",       "2022-01-15",
      "starts",       9,                     1,           "2022-01-05",       "2022-01-07",
      "equals",      10,                     1,           "2022-01-05",       "2022-01-10",
      "reference",    1,                     2,           "2022-01-05",       "2022-01-10",
      "precedes",     2,                     2,           "2022-01-01",       "2022-01-03",
      "meets",        3,                     2,           "2022-01-01",       "2022-01-04",
      "overlaps",     5,                     2,           "2022-01-01",       "2022-01-05",
      "finished_by",  7,                     2,           "2022-01-01",       "2022-01-10",
      "contains",     8,                     2,           "2022-01-01",       "2022-01-15",
      "starts",       9,                     2,           "2022-01-05",       "2022-01-07",
      "equals",      10,                     2,           "2022-01-05",       "2022-01-10") %>%
      dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date)) %>%
      dplyr::filter(.data$relationship %in% c("reference", .env$test_case))
  }

  con <- DBI::dbConnect(duckdb::duckdb())

  DBI::dbWriteTable(con, "intervals", intervals, overwrite = T)

  start_sql <- dbplyr::win_over(sql("min(cohort_start_date)"), partition = c("subject_id", "cohort_end_date"),  con = con)
  end_sql <-   dbplyr::win_over(sql("max(cohort_end_date)"),   partition = c("subject_id", "cohort_start_date"), con = con)

  prev_start_sql <- dbplyr::win_over(sql("min(cohort_start_date)"), partition = "subject_id", frame = c(-Inf, -1), order = "cohort_start_date", con = con)
  prev_end_sql   <- dbplyr::win_over(sql("max(cohort_end_date)"),   partition = "subject_id", frame = c(-Inf, -1), order = "cohort_start_date", con = con)

  df <- tbl(con, "intervals") %>%
    transmute(subject_id,
              start_date = start_sql,
              end_date   = end_sql,
              prev_start = prev_start_sql,
              prev_end = prev_end_sql) %>%
    distinct() %>%
    transmute(subject_id,
              end_date,
              valid_start = ifelse(!is.na(prev_start) & between(start_date, prev_start, prev_end), prev_start, start_date)) %>%
    group_by(subject_id, valid_start) %>%
    summarise(valid_end = max(end_date, na.rm = TRUE), .groups = "drop") %>%
    collect()

  DBI::dbDisconnect(con, shutdown = T)
  df
}

test_cases <- c(
  "simple", "same starts, different ends", "duplicate rows", "Era carried forward by inner interval",
  "precedes", "meets", "overlaps", "finished_by", "contains", "starts", "equals")

purrr::map(test_cases, test_collapse) |> rlang::set_names(test_cases)
#> $simple
#> # A tibble: 1 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          1 2022-01-05  2022-01-10
#> 
#> $`same starts, different ends`
#> # A tibble: 1 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          1 2022-01-05  2022-01-20
#> 
#> $`duplicate rows`
#> # A tibble: 1 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          1 2022-01-05  2022-01-10
#> 
#> $`Era carried forward by inner interval`
#> # A tibble: 1 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          1 2022-01-05  2022-01-27
#> 
#> $precedes
#> # A tibble: 4 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-01  2022-01-03
#> 2          2 2022-01-05  2022-01-10
#> 3          1 2022-01-01  2022-01-03
#> 4          1 2022-01-05  2022-01-10
#> 
#> $meets
#> # A tibble: 4 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-01  2022-01-04
#> 2          2 2022-01-05  2022-01-10
#> 3          1 2022-01-01  2022-01-04
#> 4          1 2022-01-05  2022-01-10
#> 
#> $overlaps
#> # A tibble: 2 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-01  2022-01-10
#> 2          1 2022-01-01  2022-01-10
#> 
#> $finished_by
#> # A tibble: 2 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-01  2022-01-10
#> 2          1 2022-01-01  2022-01-10
#> 
#> $contains
#> # A tibble: 2 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-01  2022-01-15
#> 2          1 2022-01-01  2022-01-15
#> 
#> $starts
#> # A tibble: 2 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-05  2022-01-10
#> 2          1 2022-01-05  2022-01-10
#> 
#> $equals
#> # A tibble: 2 × 3
#>   subject_id valid_start valid_end 
#>        <dbl> <date>      <date>    
#> 1          2 2022-01-05  2022-01-10
#> 2          1 2022-01-05  2022-01-10

Created on 2023-06-21 with reprex v2.0.2

I've used dplyr to construct the target SQL query. I plan to test the translation on each OHDSI dbms. OHDSI-SQL should work similarly though. All of these results look correct to me.

Performance

TBD 😄 . Probably best to establish correctness first.

chrisknoll commented 1 year ago

Thanks, I'll try to digest this.

One thing I would ask you could change/elaborate on: could you change cte1 and cte2 into meaningful names? I'd like some moniker that would represent what that set of data represents.

This is why I like the "functional SQL" approach taken by dplyr in R and funSQL in Julia. We can define these useful primitive operations and then compose them (|> denotes the composition operation).

I definitely don't want to get into debates over functional programming vs. other things, but I will note that I don't think SQL lends itself well to composition, but rather more towards set-oriented operations....but this is coming from a perspective of very shallow usage of functional/compositional paradigms, but to be fair, we're trying to solve this in a SQL-space.

So, let me give you my initial thoughts, with a spoiler that I think you may be on to something, which I'll try to explain.

The first part about your pseudo-pseudo code: I understand a bit better how you're trying to start with non-duplicated starts/ends so that the remainder of your process can work. One benefit of the 'existing' algo is that it doesn't depend on that data-prep step, but I do understand now how the partition of start dates and end dates will group those identical start/ends together so you can collapse them into distinct intervals. A few thoughts:

  1. From a performance perspective, I'd count 'group', 'partition', and 'order' operations in a query to understand the computational load. cte1 is:
    select distinct 
    person_id, 
    min(start_dt) over (partition by person_id, end_date) as start_dt,
    max(end_date) over (partition by person_id, start_dt) as end_date,
    min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
    max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
    from intervals

So there's partitions on end_date, start_dt, and an order by start_dt (which is used in 2 different window functions, so I think it only will do 1 sort here). All of these will arrange the source rows in intervals a different way, so there's a cost to it, and then you finally have a 'distinct' which has to sweep through the set and remove dupes. So, I think the point of CTE1 is to just prepare a set of data that de-dupes but also contains the prior row's min start and max end (which you call previous_start and previous end) which you'll use in the second part of your algorithm...interesting note tho, that the reference to start_dt/end_date that is used in the window functions are coming from the portioned set of data from intervals, and not from the as start_dt that you are projecting from the column 2 and 3 of CTE1. I'm sure you know that, but just pointing out that it's a little confusing when you yield a column from a window function that is the same name as the source column.

Then when you get into CTE2, you need the result of CTE1 in order to have the min/max/previous set 'yielded' so you can do your case statements around a more readable form, tho without ctes, it's simply:

select person_id, end_date,
  case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date 
     then previous_start_dt 
     else start_dt end as valid_start
FROM (
  select distinct 
    person_id, 
    min(start_dt) over (partition by person_id, end_date) as start_dt,
    max(end_date) over (partition by person_id, start_dt) as end_date,
    min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
    max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
  from intervals
) cte1

So, from a performance perspective, I'd compare distinct (or group by), partitions and sorts to try to evaluate the computational load difference between them. I'd also rely on dbms-specific tools like explain or show plan to see where the sql engine is performing sorts and memory swaps, etc. That would give you the exact computational load, with the cost of having to digest and interpret the runtime statistics (which may be different based on memory availability, set size, etc)...so sometimes I just look at the logical ops in the query manually.

  1. The algorithm logic

This is where I think you may be onto something. I think the main difference between the original implementation is the original finds ends by using the sum(event_type) (which sums -1s and +1s until you get to a zero) to identify the actual end dates, while your approach is using a case/if statement to decide if a row should start a new interval when the current start is later than the maximum prior ends ie: when you have a start date that is later than any of your previous end dates, that means you've hit a start date that is a new interval. That will yield you your 'valid starts' as you say. Then once you group your valid starts to the max(ends) from the prior (including current) rows, you have your eras.

So I think where it might be a possible improvement is that by using the case logic with 'is this row's start date later than previous row's max(end) and the end dates are always the maximum end_dates from the prior rows (I think you have to include the current row in this), then that eliminates the need to 'extract' start/end dates into a single set where you sum() to find which of the dates are actual end dates. This would reduce the algo to just sort by start_date (within a partition of person_id), and you use the window function's lookback (rows unbound preceding) to find some value within that partition/ordered set. I think it would just be one partition/order operation, which would reduce the computational load by a lot (theoretically).

I think the initial de-duping logic is unnecessary here and in most cases is pointless work: it's not typically the case that you have a ton of matching start and end dates that you would be able to partition together: if you look at a patient profile, events are scattered all over the place, and it's really unlikely that you see things end all at the same date (tho you might see a few things start on the same date because it is possible events originate out of the same healthcare encounter, but then they may have different durations afterwards. It would be interesting what your query looks like if you remove the prep work of de-duping starts/ends and just take the raw data, sort it by start date, and use your case/when statement to determine if you've reached an interval based on the current start_date being later than the latest end date from the prior rows.

  1. Test data and completeness

I see what you're going after by tying to achieve 'completeness' by covering all the cases described in Allen's interval algebra, but having intervals that exhibit those relationships doesn't guarantee you have covered all your cases: the collapsing logic is complicated because of the different ways an era can be 'extended' by the combination of overlapping rows that are formed across multiple records. So while you will certainly see instances of Allen's intervals in yoru test cases, it's a combination of those intervals in creative ways that will get you to completeness, not that you've put one of those intervals in each case. That's just my gut instinct...but I think to call this solution 'complete' (or full test coverage) we'd want to define the individual cases (by dataset examples) that illustrate a logical challenge for the algorithm, and then demonstrate the correct result. So things like 'duplicate intervals', 'sequential extension', 'one-row-to-rule-them-all' would be some examples of special circumstances that the algorithm should be able to deal with.

Ok, that's all I got at 1:30am! Hope this makes sense, and looking forward to another round of dialog!

ablack3 commented 1 year ago

Thanks Chris. 1:30! Hope you're sleeping in this morning!!

There's a lot to respond to but I found something I wanted to run past you. I'm trying to implement the current query so I can make a comparison and found something very weird. It seems to give non-deterministic results! At least on duckdb.

Here is my reprex. I ran a simple example 1000 times. ~33% of the time it gives the incorrect result! what the?

Am I doing something wrong here?

library(dplyr, warn.conflicts = F)

test_current_query <- function(x) {

  raw_data <- tibble::tribble(
    ~person_id,  ~start_date,    ~end_date,
    46,         "1991-11-29", "1991-11-29",
    46,         "1991-11-29", "2006-01-27",
    46,         "1994-12-16", "1994-12-16"
  ) %>% dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

  con <- DBI::dbConnect(duckdb::duckdb())

  DBI::dbWriteTable(con, "raw_data", raw_data)

  result <- DBI::dbGetQuery(con, "
  with cteEndDates (person_id, end_date) AS
  (
    SELECT
        person_id,
        event_date as end_date
    FROM
    (
        SELECT
            person_id,
            event_date,
            SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
        FROM
        (
            SELECT
                person_id,
                start_date AS event_date,
                -1 AS event_type
            FROM raw_data

            UNION ALL

            SELECT
                person_id,
                end_date AS event_date,
                1 AS event_type
            FROM raw_data
        ) RAWDATA
    ) e
    WHERE interval_status = 0
  ),
  cteEnds (person_id, start_date, end_date) AS
  (
    SELECT
        c.person_id,
        c.start_date,
        MIN(e.end_date) AS end_date
    FROM raw_data c
    INNER JOIN cteEndDates e ON c.person_id = e.person_id
        AND e.end_date >= c.start_date
    GROUP BY
        c.person_id,
        c.start_date
  )
  select person_id, min(start_date) as start_date, end_date
  from cteEnds
  group by person_id, end_date
  ;") %>% dplyr::tibble()

  DBI::dbDisconnect(con, shutdown = TRUE)
  return(result)
}

# run the test 1000 times
results <- tibble(run = 1:1000, df = purrr::map(1:1000, test_current_query))

# summarize test results
summary <- results %>%
  count(df)

summary
#> # A tibble: 2 × 2
#>   df                   n
#>   <list>           <int>
#> 1 <tibble [1 × 3]>   679
#> 2 <tibble [2 × 3]>   321

summary$df[[1]]
#> # A tibble: 1 × 3
#>   person_id start_date end_date  
#>       <dbl> <date>     <date>    
#> 1        46 1991-11-29 2006-01-27

summary$df[[2]]
#> # A tibble: 2 × 3
#>   person_id start_date end_date  
#>       <dbl> <date>     <date>    
#> 1        46 1991-11-29 1991-11-29
#> 2        46 1994-12-16 1994-12-16

Created on 2023-06-22 with reprex v2.0.2

Opened issue on duckdb https://github.com/duckdb/duckdb/issues/8041

ablack3 commented 1 year ago

Here's the similar thing on postgres.

library(dplyr, warn.conflicts = F)
library(DatabaseConnector)
cd <- createConnectionDetails("postgresql", user = "postgres", password = "", server = "localhost/covid")
con <- connect(cd)
#> Connecting using PostgreSQL driver

raw_data <- tibble::tribble(
  ~person_id,  ~start_date,    ~end_date,
  46,         "1991-11-29", "1991-11-29",
  46,         "1991-11-29", "2006-01-27",
  46,         "1994-12-16", "1994-12-16"
) %>% dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))

DBI::dbWriteTable(con, "scratch.raw_data", raw_data, overwrite = T)
#> Inserting data took 0.0251 secs

test_current_query <- function(x) {
  result <- DBI::dbGetQuery(con, "
  with cteEndDates (person_id, end_date) AS
  (
    SELECT
        person_id,
        event_date as end_date
    FROM
    (
        SELECT
            person_id,
            event_date,
            SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
        FROM
        (
            SELECT
                person_id,
                start_date AS event_date,
                -1 AS event_type
            FROM scratch.raw_data

            UNION ALL

            SELECT
                person_id,
                end_date AS event_date,
                1 AS event_type
            FROM scratch.raw_data
        ) RAWDATA
    ) e
    WHERE interval_status = 0
  ),
  cteEnds (person_id, start_date, end_date) AS
  (
    SELECT
        c.person_id,
        c.start_date,
        MIN(e.end_date) AS end_date
    FROM scratch.raw_data c
    INNER JOIN cteEndDates e ON c.person_id = e.person_id
        AND e.end_date >= c.start_date
    GROUP BY
        c.person_id,
        c.start_date
  )
  select person_id, min(start_date) as start_date, end_date
  from cteEnds
  group by person_id, end_date
  ;") %>% dplyr::tibble()

  return(result)
}

# run the test 1000 times
results <- tibble(run = 1:1000, df = purrr::map(1:1000, test_current_query))

# summarize test results
summary <- results %>%
  count(df)

summary
#> # A tibble: 1 × 2
#>   df                   n
#>   <list>           <int>
#> 1 <tibble [1 × 3]>  1000

summary$df[[1]]
#> # A tibble: 1 × 3
#>   person_id start_date end_date  
#>       <dbl> <date>     <date>    
#> 1        46 1991-11-29 2006-01-27

Created on 2023-06-22 with reprex v2.0.2

So maybe the issue was duckdb. Weird.

chrisknoll commented 1 year ago

The part that is non-deterministic is this:

 SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status

If you don't include event_type in the sum() order by, then it may sum a -1 first in one run, and a +1 in another run. We want to ensure that -1 comes before +1 in the row order.

Ie:

 SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date, event_type ROWS UNBOUNDED PRECEDING) AS interval_status

If it's not doing that in the CohortIncidence script, its' a bug that should be fixed.

Update, looks like it is wrong, I'll apply a fix to this.

Update 2: I published a hotfix, this is the specific change that addresses it.

chrisknoll commented 1 year ago

Another note on the algo comments I made: if I understand correctly that starts are those start_dates where it is greater than any of the prior end dates (not including the current row), then ROWS UNBOUND PRECEEDING does include the current row, and you'll want to adjust that ROWS statement to use all prior but not current row. This is done with:

rows between unbounded preceding and 1 preceding
chrisknoll commented 1 year ago

@ablack3 ,

I'm running a very large scale CohortIncidence for HowOften. 99% of the work is being done in creating the eras of TAR and Excluded Time, so I've decided to revisit this issue and get to Era-fy SQL 2.0. Here's what I have come up with.

select person_id, min(start_date) as era_start, max(end_date) as era_end, count(*)
from (
  select id, person_id, start_date, end_date, sum(is_start) over (partition by person_id order by start_date, id rows unbounded preceding) group_idx
  from (
    select id, person_id, start_date, end_date, 
      max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) latest_end,
      case when max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) >= start_date then 0 else 1 end is_start
    from #RAW_EVENTS
  ) E
) G
GROUP BY person_id, group_idx

Simpler, eh?

Here's the logic:

  1. (inner query E): determine if a row is a start by partitioning the raw data by person_id, and order by start date. A row is a start date when it is the first row for the person (max(end_date) ... rows between unbounded preceding and 1 preceding will be null for the first row) OR if the current row's start date is later than the maximum end_date of the preceding rows. This query yields the is_start = 0 or 1 depending on this result.
  2. (outer query G): by ordering again by the start_date, we calculate a cumulative sum of is_start to create a 'grouping index' (group_idx) that will bring all the records together that belong to the same era.
  3. (final query): We can do a min(start_date) and max(end_date) grouped by the grouping index to find our eras.

Some notes on this approach: We have a row ID to 'break ties when there are duplicate start dates. Whenever we have that sort of issue in data, we run into the 'indeterministic problem' you ran into with DuckDB. The alternative is to remove dupes by doing the technique you had earlier to group by start_date, max(end_date) to bring events starting on same date together. However, even with the non-deterministic nature of it, I don't think it matters in this case: The inner query's 'latest end' is just a debugging column and we ran into a deterministic issue when we needed the sort order of one partition to line up with a sort order of another. In the final query I'd drop the latest_end column.

In this new approach, we only need the one partition/sort order to find the rows that represent starts, and a separate, unrelated partition/sort order to sum up the group indexes. (Side note: I attempted to both operations in one query context, but it required 'nesting window functions' which is not allowed.) In addition, there are zero joins in this approach at all, but depending on how the DBMS implements subqueries, there could be 'intermediate' results from subquery E and G, but a smart optimizer may be able to combine those together so that there's just a single intermediate result.

So, this form reduces partition/sorts and joins, so technically should run much more efficiently. Would you mind taking this for a spin? Below is my test code that tests a few test cases:

-- DROP TABLE #RAW_EVENTS;
CREATE TABLE #RAW_EVENTS (
  id bigint not null,
  person_id bigint not null,
  start_date date not null,
  end_date date not null)
 DISTKEY (person_id);

TRUNCATE TABLE #RAW_EVENTS;

-- TEST 1
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-15'),
(2, 1, '2001-01-01', '2001,02-15'),
(3, 1, '2001-01-13', '2001,01-13'),
(4, 1, '2001-01-14', '2001,02-03'),
(5, 1, '2001-02-01', '2001,04-01'),
(6, 1, '2001-05-01', '2001,05-15'),
(7, 1, '2001-05-05', '2001,05-10'),
(7, 1, '2001-01-01', '2001,06-01')
;

-- Test 2
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-15'),
(2, 1, '2001-01-01', '2001,01-15'),
(3, 1, '2001-01-01', '2001,01-16'),
(4, 1, '2001-01-01', '2001,01-15')
;

-- Test 3
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-02'),
(2, 1, '2001-01-02', '2001,01-03'),
(3, 1, '2001-01-03', '2001,01-04'),
(4, 1, '2001-01-04', '2001,01-05')
;

select * from #RAW_EVENTS;

select person_id, min(start_date) as era_start, max(end_date) as era_end, count(*)
from (
  select id, person_id, start_date, end_date, sum(is_start) over (partition by person_id order by start_date, id rows unbounded preceding) group_idx
  from (
    select id, person_id, start_date, end_date, 
      max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) latest_end,
      case when max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) >= start_date then 0 else 1 end is_start
    from #RAW_EVENTS
  ) E
) G
GROUP BY person_id, group_idx
;
ablack3 commented 1 year ago

Woah yes very nice Chris! So my next task on this issue is to write a test that I want to run on all OHDSI databases. I'll try to get this done before symposium. It's a busy time but it's on my list. Yea I really like this new query. It is easier to read anyway.