Open ablack3 opened 2 years ago
@pbr6cornell - Here is a possible simplification I proposed to the era SQL logic. I finally closed it after >6 months with no interest. Could be that my code was incorrect though. I'm not sure if anyone looked at it.
Thanks @ablack3, really glad to have your critical thinking on this, and sorry that I missed your earlier post. If there's a more efficient approach to building drug_eras then I think we should all be eager to take advantage of that. It's a bit unclear to me procedurally where we'd want to share this insights. Looking to others for help. @clairblacketer @MelaniePhilofsky @chrisknoll . My immediate thought, that I'm not wedded to, is this may be the jurisdiction of the newly re-emerged THEMIS workgroup, who could decide on community convention for how to build eras, and it'd seem reasonable to recommend a codebase in the CDM repo as a best practice. I don't think there's a need to enforce that people use the same code to build eras, though clearly there should be a strong incentive if there's already efficient code with nice unit tests that is available, not sure why anyone would willingly want to re-invent the wheel (though I know some ETL processes are more conducive to processing eras one-person-at-a-time rather than bulk, so some modifications may need to be made).
I'm sorry I didn't see this either, it is possibly i'm only subscribed to CommonDataModel issues where I am tagged.
I looked it over a few times but I'm having trouble seeing the difference in the code. However, there has been some changes to the 'era builder' logic that Patrick came up with that we applied into the CohortIncidence module, you can see it here.
Explanation:
The current form you have uses a INNER JOIN and the calculation of WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0
to determine the end dates. Patrick noticed that we can simply find the end dates by using the event type (-1 for starts, 1 for end) to do a SUM() of the prior records that will subtract 1 for the start, and increment for the ends...leading to the case where when your SUM(prior) = 0, you've 'closed' all the prior starts. This prevents the need to yield the start_ord / overall_ord logic which incurs this cost of a join, and instead just take all starts and ends, line them up and then add -1 or +1 for each row to find the 'true ends' (sum() = 0).
So, while the form linked above leverages CTEs, it can be written to use subqueries but I think the modification Patrick came up with simplifies it very nicely.
Oh nice! It has been a while since I've look at this. I think I was able to remove one subquery and was thinking it might extend to other places where the era logic is used in the OHDSI codebase. Patrick's simplification looks better though at first glance. I'll need to spend some time to look over it.
I've also been thinking about how to test this type of query to guarantee it always produces correct results on all dbms. I recently added union and intersection functions (written with dplyr) to the Darwin R package I'm working on here and tests here. FYI.
And sorry for not tagging someone earlier. I'm not always sure where to put things. The OHDSI/Odysseus communication surface area is quite large!
Playing around with this idea for union
... Seems promising but there are two cases where it fails ("starts" and "finished by"). Are analytic functions (lag and lead) more efficient than union all?
union <- function(con, rel) {
intervals <- tibble::tribble(
~relationship, ~definition_id, ~person_id, ~start_dt, ~end_date,
"reference", 1, 1, "2022-01-05", "2022-01-10",
"precedes", 1, 1, "2022-01-01", "2022-01-03",
"meets", 1, 1, "2022-01-01", "2022-01-04",
"overlaps", 1, 1, "2022-01-01", "2022-01-05",
"finished_by", 1, 1, "2022-01-01", "2022-01-10",
"contains", 1, 1, "2022-01-01", "2022-01-15",
"starts", 1, 1, "2022-01-05", "2022-01-07",
"equals", 1, 1, "2022-01-05", "2022-01-10",
"reference", 1, 2, "2022-01-05", "2022-01-10",
"precedes", 1, 2, "2022-01-01", "2022-01-03",
"meets", 1, 2, "2022-01-01", "2022-01-04",
"overlaps", 1, 2, "2022-01-01", "2022-01-05",
"finished_by", 1, 2, "2022-01-01", "2022-01-10",
"contains", 1, 2, "2022-01-01", "2022-01-15",
"starts", 1, 2, "2022-01-05", "2022-01-07",
"equals", 1, 2, "2022-01-05", "2022-01-10",
"reference", 2, 1, "2022-01-05", "2022-01-10",
"precedes", 2, 1, "2022-01-01", "2022-01-03",
"meets", 2, 1, "2022-01-01", "2022-01-04",
"overlaps", 2, 1, "2022-01-01", "2022-01-05",
"finished_by", 2, 1, "2022-01-01", "2022-01-10",
"contains", 2, 1, "2022-01-01", "2022-01-15",
"starts", 2, 1, "2022-01-05", "2022-01-07",
"equals", 2, 1, "2022-01-05", "2022-01-10",
"reference", 2, 2, "2022-01-05", "2022-01-10",
"precedes", 2, 2, "2022-01-01", "2022-01-03",
"meets", 2, 2, "2022-01-01", "2022-01-04",
"overlaps", 2, 2, "2022-01-01", "2022-01-05",
"finished_by", 2, 2, "2022-01-01", "2022-01-10",
"contains", 2, 2, "2022-01-01", "2022-01-15",
"starts", 2, 2, "2022-01-05", "2022-01-07",
"equals", 2, 2, "2022-01-05", "2022-01-10") |>
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date)) |>
dplyr::filter(relationship %in% c("reference", rel))
DBI::dbWriteTable(con, "intervals", intervals, overwrite = T)
print(dplyr::tbl(con, "intervals"))
print(tibble::tibble(DBI::dbGetQuery(con, "
WITH cte AS (
SELECT DISTINCT person_id, start_dt, end_date,
CASE WHEN
(start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
(start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
(start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by start_dt), end_date + 1)) AND
(start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by start_dt), end_date + 1))
THEN start_dt ELSE NULL END AS valid_start,
CASE WHEN
(end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
(end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by end_date), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by end_date), end_date + 1)) AND
(end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE( lag(end_date) OVER(partition by person_id order by start_dt), end_date + 1)) AND
(end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(partition by person_id order by start_dt), end_date + 1) AND COALESCE(lead(end_date) OVER(partition by person_id order by start_dt), end_date + 1))
THEN end_date ELSE NULL END AS valid_end from (SELECT DISTINCT person_id, start_dt, end_date FROM intervals)
)
SELECT DISTINCT a.person_id, a.valid_start AS start_dt, MIN(b.valid_end) AS end_date
FROM cte a JOIN cte b on a.person_id = b.person_id
WHERE a.valid_start <= b.valid_end
GROUP BY a.person_id, a.valid_start
")))
}
con <- DBI::dbConnect(duckdb::duckdb())
purrr::walk(c("precedes", "meets", "overlaps", "finished_by", "contains", "starts", "equals"), ~union(con, .))
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 precedes 1 1 2022-01-01 2022-01-03
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 precedes 1 2 2022-01-01 2022-01-03
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 precedes 2 1 2022-01-01 2022-01-03
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 precedes 2 2 2022-01-01 2022-01-03
#> # A tibble: 4 × 3
#> person_id start_dt end_date
#> <dbl> <chr> <date>
#> 1 2 2022-01-01 2022-01-03
#> 2 2 2022-01-05 2022-01-10
#> 3 1 2022-01-01 2022-01-03
#> 4 1 2022-01-05 2022-01-10
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 meets 1 1 2022-01-01 2022-01-04
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 meets 1 2 2022-01-01 2022-01-04
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 meets 2 1 2022-01-01 2022-01-04
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 meets 2 2 2022-01-01 2022-01-04
#> # A tibble: 4 × 3
#> person_id start_dt end_date
#> <dbl> <chr> <date>
#> 1 2 2022-01-01 2022-01-04
#> 2 2 2022-01-05 2022-01-10
#> 3 1 2022-01-01 2022-01-04
#> 4 1 2022-01-05 2022-01-10
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 overlaps 1 1 2022-01-01 2022-01-05
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 overlaps 1 2 2022-01-01 2022-01-05
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 overlaps 2 1 2022-01-01 2022-01-05
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 overlaps 2 2 2022-01-01 2022-01-05
#> # A tibble: 2 × 3
#> person_id start_dt end_date
#> <dbl> <chr> <date>
#> 1 2 2022-01-01 2022-01-10
#> 2 1 2022-01-01 2022-01-10
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 finished_by 1 1 2022-01-01 2022-01-10
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 finished_by 1 2 2022-01-01 2022-01-10
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 finished_by 2 1 2022-01-01 2022-01-10
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 finished_by 2 2 2022-01-01 2022-01-10
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl>, start_dt <chr>, end_date <date>
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 contains 1 1 2022-01-01 2022-01-15
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 contains 1 2 2022-01-01 2022-01-15
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 contains 2 1 2022-01-01 2022-01-15
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 contains 2 2 2022-01-01 2022-01-15
#> # A tibble: 2 × 3
#> person_id start_dt end_date
#> <dbl> <chr> <date>
#> 1 2 2022-01-01 2022-01-15
#> 2 1 2022-01-01 2022-01-15
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 starts 1 1 2022-01-05 2022-01-07
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 starts 1 2 2022-01-05 2022-01-07
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 starts 2 1 2022-01-05 2022-01-07
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 starts 2 2 2022-01-05 2022-01-07
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl>, start_dt <chr>, end_date <date>
#> # Source: table<intervals> [8 x 5]
#> # Database: DuckDB 0.8.0 [root@Darwin 21.6.0:R 4.2.2/:memory:]
#> relationship definition_id person_id start_dt end_date
#> <chr> <dbl> <dbl> <chr> <date>
#> 1 reference 1 1 2022-01-05 2022-01-10
#> 2 equals 1 1 2022-01-05 2022-01-10
#> 3 reference 1 2 2022-01-05 2022-01-10
#> 4 equals 1 2 2022-01-05 2022-01-10
#> 5 reference 2 1 2022-01-05 2022-01-10
#> 6 equals 2 1 2022-01-05 2022-01-10
#> 7 reference 2 2 2022-01-05 2022-01-10
#> 8 equals 2 2 2022-01-05 2022-01-10
#> # A tibble: 2 × 3
#> person_id start_dt end_date
#> <dbl> <chr> <date>
#> 1 2 2022-01-05 2022-01-10
#> 2 1 2022-01-05 2022-01-10
DBI::dbDisconnect(con, shutdown = T)
Created on 2023-06-12 with reprex v2.0.2
Sorry, I am not clear on what you mean by UNION ALL vs. window functions....
Looking at your code, I'm seeing several different sort orders while the one Patrick has only sorts a set of days once (after extracting all starts and ends). I'm fairly certain that issuing those different sorcs across the window functions will be slower, but the only way to know is to test.
Just wondering if it's possible to have some intuition about which query is faster. I'll try testing them. I really only need two sorts (by start date and end date) for this algorithm. Not sure if there's a better way to write it in SQL.
Here's some advice from the bigquery documentation. Not sure if this applies broadly to all dbms though.
Some initial comparison results.
SQL Server
Union All looks considerably faster on sql server
Redshift
less of a difference on redshift.
Here's the query I'm running which probably still needs some tweaking. I'm purposely not grouping by drug ingredient right now because I want to simply see if the time interval union operation can be made faster using this approach. The time interval union operation is a fundamental building block.
WITH cte1 AS (
SELECT DISTINCT
person_id,
MIN(drug_exposure_start_date) OVER (PARTITION BY person_id, drug_exposure_end_date) AS start_dt,
MAX(coalesce(drug_exposure_end_date, drug_exposure_start_date)) OVER (PARTITION BY person_id, drug_exposure_start_date) AS end_date
FROM {cdm_schema}.drug_exposure
WHERE drug_exposure_start_date IS NOT NULL
),
cte2 AS (
SELECT DISTINCT person_id, start_dt, end_date,
CASE WHEN
(start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
(start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
(start_dt NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date))) AND
(start_dt NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)))
THEN start_dt ELSE NULL END AS valid_start,
CASE WHEN
(end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
(end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY end_date), DATEADD(day, 1, end_date))) AND
(end_date NOT BETWEEN COALESCE( lag(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE( lag(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date))) AND
(end_date NOT BETWEEN COALESCE(lead(start_dt) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)) AND COALESCE(lead(end_date) OVER(PARTITION BY person_id ORDER BY start_dt), DATEADD(day, 1, end_date)))
THEN end_date ELSE NULL END AS valid_end
FROM cte1
),
cte3 AS (
-- could also use a join instead of a subquery. Not sure which is better.
SELECT DISTINCT
person_id,
MAX(valid_start) OVER (partition by person_id ORDER BY valid_start ROWS UNBOUNDED PRECEDING) AS start_date,
valid_end AS end_date
FROM cte2
)
SELECT * FROM cte3 WHERE end_date IS NOT NULL and start_date is not null -- is there a way to remove this last subquery?
I looked it over a few times but I'm having trouble seeing the difference in the code.
Here is a diff with the updated code on the left and current code on the right.
Ok, i wouldn't look at that old form, I'd use the new form described above that does the sum(event_type) to keep track of opens and closes.
I'm not sure (in the new form) where the 'self join' lies...we need to capture the set of start and end dates from the rows to erafy and then sum up the event_type to identify which dates are the 'ends' (ie: sum(event_type) = 0). Then, we need to take the min(start_date) for the rows that are grouped by the end dates and that gives us the eras. But to do that, we do need to group the start dates to the earliest end date that occurrs after each start, and then once we have that, to collapse the rows to a start-end using min(start_date) group by end_date.
But I think the part of the code you are focusing on is to determine the 'end dates'.
I'm not sure how a lead/lag window function helps us here because a era could be built across an arbitrary number of rows....we are using a window function (via the SUM() over (order by event_date)) but the step in the UNION ALL is used to combine start dates and end dates into a single list.
To illustrate the 'acoss arbitrary number of rows' case:
|-------------------------------------------------|
|====|
|====|
|====|
|====|
The figure is 'order by start_date) The first row keeps the remaining 4 rows open even tho the remaining 4 don't overlap each other. Maybe there is a way to avoid the step of merging start/end together and looking at the raw event rows and doing some sort of logic with the start/end, but I'm pretty sure I came to the conclusion of merging start/end dates handled all cases. We even see this in logic in treatment pathways.
Yea I see. I think you might be right. I was curious about trying to find the end_dates without needing the "union all" step. But the example you showed is a counter-example so I'll need to think about it some more. My bigger question is if we can "prove" that the algorithm will always produce the correct result. Like write a test that covers all possible cases.
For the drug_era query it seems like we can simplify it using Patrick's improvement. I'll take a stab at it.
@chrisknoll would this query work for getting start and end dates? It removes the join and union all. How would I test all possible cases?
with cte1 as (
select distinct
person_id,
min(start_dt) over (partition by person_id, end_date) as start_dt,
max(end_date) over (partition by person_id, start_dt) as end_date,
min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
from intervals
),
cte2 as(
select person_id, end_date,
case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date
then previous_start_dt else start_dt end as valid_start
from cte1
)
select person_id, valid_start, max(end_date) as valid_end
from cte2
group by person_id, valid_start
Hi, @ablack3,
@chrisknoll would this query work for getting start and end dates? It removes the join and union all.
I'm not sure. I'm having a hard time locking into the logic that yields the end dates from your code. Maybe if you provided a pseudo-code type of description about what's happening? I am a little suspicious of the different partitions being applied on the different columns, but I think it's core to the concept you're presenting, I just don't follow it.
For example, the way I'd describe the 'end dates' determination from the original query, it is:
Create the list of event dates by extracting start_date as event _date, -1 as event_type
from starts, end_date as event date, 1 as event-type
from ends, cnd sort that list by event_date, event_type, causing 'start' event dates to come before ends when they are on the same date. This has the effect of causing the starting date being accumulated in the SUM() before ends.
End dates are those dates where the sum() results in zero, ie: the -1 from the starts is matched to the +1 of the ends, and when the sum is zero, then all starts have been closed up to that point in time. These dates are 'era-end-dates'.
The next part of the algorithm (which I don't think you're interested here, but for completeness): once you know the end dates, then you can determine a start-end duration by taking the minimum (era_end_date) for any end date that occurs after the start dates, and from this result you take the min(start_date) group by era_end_date, to collapse the intermediate rows into the 'era-fied rows'.
So I think you're just focused on determining the end_dates of the eras, but I'm not clear on how the partitioning strategy you used works, so if you could explain the logic, that would help me. I may not need to know, tho, if we can create a comprehensive set of test cases that demonstrate the robustness.
How would I test all possible cases?
It would depend on what technology you want to use to run tests. On the Java side, I have a nice library that is designed for DB Unit Tests (DBUnit) which has some usefull utility classes to populate a test db schema with test data, run your tests to generate results, and then assert
functions to determine if the expected results match the actual results.
If you're doing this in R, then I think a combination of DatabaseConnector loadTable()
can be used to insert test cases, then you execute the test query to get a results dataframe, and then you can use core R functions to get a diff between dataframes (there is probably a library that can do this). What I like about DBUnit is that the comparisons are implements in asserts, and if the assert fails it tells you which row, column and how the data is different between the expected and actual results. Maybe in R, we would fire up a duckdb instance to host the tests and run the SQL. I wouldn't use SqlLite because I think date column support is unsatisfactory for what you're doing here.
As far as test cases? I can give it to you visually"
|=====|
Results:
|=====|
|---|
|------|
|---------|
|------------|
|---------------|
Results:
|---------------|
|--------|
|--------|
|--------|
Results:
|--------|
|---------|
|---------|
|---------|
|----------------------------------------|
|---------|
|---------|
|---------|
Results:
|--------------------------------------------------------------------|
Admittedly, probably easier to provide test cases in tabular/csv form where you have the start-end dates in the input and the output start-end after collapse, but These are the types of tests I'm thinking about.
Wow, thanks for this detailed guidance Chris. I'll try to write up a complete response in the next few days.
Hi @chrisknoll,
I want to work through three topics: 1. intuition, 2. correctness, and 3. performance of the algorithm. I am indeed interested in getting both start and end dates. This is an interval union operation which forms the basis for the era-fy operation used in the drug era queries as well as circe cohort definitions. Needless to say this SQL gets executed a lot. I would say this interval union operation is more fundamental than era-fy because era-fy can be defined as
pad_end_dates(gap) |> interval_union() |> unpad_end_dates(gap)
This is why I like the "functional SQL" approach taken by dplyr in R and funSQL in Julia. We can define these useful primitive operations and then compose them (|>
denotes the composition operation).
This algorithm comes from my days programming in SAS when I had to think in rowwise operations. Generally every data manipulation in SAS requires one pass through each row of the data. This makes sort order very important and in SAS you generally want to minimize the number of times sorting is performed. SQL is (supposed to be) declarative which means I just tell the database what I want as a result and don't think about how that result is obtained. However I believe in practice there are optimizations that SQL programmers use to get better performance from their queries. This knowledge has always seemed quite cryptic to me and probably highly database specific since different database engines store the data in very different ways. Regardless, my instinct is that that eliminating a join and union-all should lead to a performance gain, but we'll see.
Pseudo-pseudocode First let's deal with intervals where the start_dates are equal. To do this just group by person_id, cohort_id (or drug_ingredient_id) and start_date. For each group set the end_date = max(end_date)
Next deal with intervals where the end_dates are equal Group by person_id, cohort_id and end_date. For each group set start_date = min(start_date)
Keep only distinct combinations of person_id, cohort_id, start_date, end_date.
Now we know that we don't have duplicates or intervals where start_date or end_dates are equal.
Sort the data by start_date. Proceed top to bottom through the dataset one row at a time. On the first row set valid_start = start_date and move to the second row. On every other row, if the start_date is between the previous cumulative_min(start_date) and previous cumulative_max(end_date) then keep valid_start the same (we don't have a new interval) else set valid_start = start_date(we have a new interval)
Now we have identified all the "valid" start_dates.
To wrap up just group by person_id, cohort/ingredient_id, valid_start_date and assign valid_end_date = max(end_date)
And voila. I think that's it.
This query is my attempt to implement that algorithm in SQL. Pretty concise actually. Who know's how the database actually performs the operation though.
with cte1 as (
select distinct
person_id,
min(start_dt) over (partition by person_id, end_date) as start_dt,
max(end_date) over (partition by person_id, start_dt) as end_date,
min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
from intervals
),
cte2 as(
select person_id, end_date,
case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date
then previous_start_dt else start_dt end as valid_start
from cte1
)
select person_id, valid_start, max(end_date) as valid_end
from cte2
group by person_id, valid_start
I have ignored the cohort_id/ingredient_id but that can be easily added as an additional grouping variable.
Here is my attempt to exhaustively check the correctness of this algorithm. I'm not convinced I have covered all cases so let me know if you think of anything I've missed. My approach was to check all possible relationships between two intervals as given by Allen's interval algebra and add the cases you suggested.
test_collapse <- function(test_case) {
library(dplyr, warn.conflicts = F)
if (test_case == "simple") {
intervals <- tibble::tribble(
~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
1, 1, "2022-01-05", "2022-01-10") %>%
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
} else if (test_case == "same starts, different ends") {
intervals <- tibble::tribble(
~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
1, 1, "2022-01-05", "2022-01-10",
1, 1, "2022-01-05", "2022-01-15",
1, 1, "2022-01-05", "2022-01-17",
1, 1, "2022-01-05", "2022-01-19",
1, 1, "2022-01-05", "2022-01-20") %>%
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
} else if (test_case == "duplicate rows") {
intervals <- tibble::tribble(
~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
1, 1, "2022-01-05", "2022-01-10",
1, 1, "2022-01-05", "2022-01-10",
1, 1, "2022-01-05", "2022-01-10",
1, 1, "2022-01-05", "2022-01-10",
1, 1, "2022-01-05", "2022-01-10") %>%
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
} else if (test_case == "Era carried forward by inner interval") {
intervals <- tibble::tribble(
~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
1, 1, "2022-01-05", "2022-01-08",
1, 1, "2022-01-07", "2022-01-10",
1, 1, "2022-01-09", "2022-01-12",
1, 1, "2022-01-11", "2022-01-21",
1, 1, "2022-01-20", "2022-01-23",
1, 1, "2022-01-22", "2022-01-25",
1, 1, "2022-01-24", "2022-01-27") %>%
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
} else {
intervals <- tibble::tribble(
~relationship, ~cohort_definition_id, ~subject_id, ~cohort_start_date, ~cohort_end_date,
"reference", 1, 1, "2022-01-05", "2022-01-10",
"precedes", 2, 1, "2022-01-01", "2022-01-03",
"meets", 3, 1, "2022-01-01", "2022-01-04",
"overlaps", 5, 1, "2022-01-01", "2022-01-05",
"finished_by", 7, 1, "2022-01-01", "2022-01-10",
"contains", 8, 1, "2022-01-01", "2022-01-15",
"starts", 9, 1, "2022-01-05", "2022-01-07",
"equals", 10, 1, "2022-01-05", "2022-01-10",
"reference", 1, 2, "2022-01-05", "2022-01-10",
"precedes", 2, 2, "2022-01-01", "2022-01-03",
"meets", 3, 2, "2022-01-01", "2022-01-04",
"overlaps", 5, 2, "2022-01-01", "2022-01-05",
"finished_by", 7, 2, "2022-01-01", "2022-01-10",
"contains", 8, 2, "2022-01-01", "2022-01-15",
"starts", 9, 2, "2022-01-05", "2022-01-07",
"equals", 10, 2, "2022-01-05", "2022-01-10") %>%
dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date)) %>%
dplyr::filter(.data$relationship %in% c("reference", .env$test_case))
}
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "intervals", intervals, overwrite = T)
start_sql <- dbplyr::win_over(sql("min(cohort_start_date)"), partition = c("subject_id", "cohort_end_date"), con = con)
end_sql <- dbplyr::win_over(sql("max(cohort_end_date)"), partition = c("subject_id", "cohort_start_date"), con = con)
prev_start_sql <- dbplyr::win_over(sql("min(cohort_start_date)"), partition = "subject_id", frame = c(-Inf, -1), order = "cohort_start_date", con = con)
prev_end_sql <- dbplyr::win_over(sql("max(cohort_end_date)"), partition = "subject_id", frame = c(-Inf, -1), order = "cohort_start_date", con = con)
df <- tbl(con, "intervals") %>%
transmute(subject_id,
start_date = start_sql,
end_date = end_sql,
prev_start = prev_start_sql,
prev_end = prev_end_sql) %>%
distinct() %>%
transmute(subject_id,
end_date,
valid_start = ifelse(!is.na(prev_start) & between(start_date, prev_start, prev_end), prev_start, start_date)) %>%
group_by(subject_id, valid_start) %>%
summarise(valid_end = max(end_date, na.rm = TRUE), .groups = "drop") %>%
collect()
DBI::dbDisconnect(con, shutdown = T)
df
}
test_cases <- c(
"simple", "same starts, different ends", "duplicate rows", "Era carried forward by inner interval",
"precedes", "meets", "overlaps", "finished_by", "contains", "starts", "equals")
purrr::map(test_cases, test_collapse) |> rlang::set_names(test_cases)
#> $simple
#> # A tibble: 1 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 1 2022-01-05 2022-01-10
#>
#> $`same starts, different ends`
#> # A tibble: 1 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 1 2022-01-05 2022-01-20
#>
#> $`duplicate rows`
#> # A tibble: 1 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 1 2022-01-05 2022-01-10
#>
#> $`Era carried forward by inner interval`
#> # A tibble: 1 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 1 2022-01-05 2022-01-27
#>
#> $precedes
#> # A tibble: 4 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-01 2022-01-03
#> 2 2 2022-01-05 2022-01-10
#> 3 1 2022-01-01 2022-01-03
#> 4 1 2022-01-05 2022-01-10
#>
#> $meets
#> # A tibble: 4 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-01 2022-01-04
#> 2 2 2022-01-05 2022-01-10
#> 3 1 2022-01-01 2022-01-04
#> 4 1 2022-01-05 2022-01-10
#>
#> $overlaps
#> # A tibble: 2 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-01 2022-01-10
#> 2 1 2022-01-01 2022-01-10
#>
#> $finished_by
#> # A tibble: 2 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-01 2022-01-10
#> 2 1 2022-01-01 2022-01-10
#>
#> $contains
#> # A tibble: 2 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-01 2022-01-15
#> 2 1 2022-01-01 2022-01-15
#>
#> $starts
#> # A tibble: 2 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-05 2022-01-10
#> 2 1 2022-01-05 2022-01-10
#>
#> $equals
#> # A tibble: 2 × 3
#> subject_id valid_start valid_end
#> <dbl> <date> <date>
#> 1 2 2022-01-05 2022-01-10
#> 2 1 2022-01-05 2022-01-10
Created on 2023-06-21 with reprex v2.0.2
I've used dplyr to construct the target SQL query. I plan to test the translation on each OHDSI dbms. OHDSI-SQL should work similarly though. All of these results look correct to me.
TBD 😄 . Probably best to establish correctness first.
Thanks, I'll try to digest this.
One thing I would ask you could change/elaborate on: could you change cte1
and cte2
into meaningful names? I'd like some moniker that would represent what that set of data represents.
This is why I like the "functional SQL" approach taken by dplyr in R and funSQL in Julia. We can define these useful primitive operations and then compose them (|> denotes the composition operation).
I definitely don't want to get into debates over functional programming vs. other things, but I will note that I don't think SQL lends itself well to composition, but rather more towards set-oriented operations....but this is coming from a perspective of very shallow usage of functional/compositional paradigms, but to be fair, we're trying to solve this in a SQL-space.
So, let me give you my initial thoughts, with a spoiler that I think you may be on to something, which I'll try to explain.
The first part about your pseudo-pseudo code: I understand a bit better how you're trying to start with non-duplicated starts/ends so that the remainder of your process can work. One benefit of the 'existing' algo is that it doesn't depend on that data-prep step, but I do understand now how the partition of start dates and end dates will group those identical start/ends together so you can collapse them into distinct intervals. A few thoughts:
select distinct
person_id,
min(start_dt) over (partition by person_id, end_date) as start_dt,
max(end_date) over (partition by person_id, start_dt) as end_date,
min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
from intervals
So there's partitions on end_date
, start_dt
, and an order by start_dt
(which is used in 2 different window functions, so I think it only will do 1 sort here). All of these will arrange the source rows in intervals
a different way, so there's a cost to it, and then you finally have a 'distinct' which has to sweep through the set and remove dupes. So, I think the point of CTE1 is to just prepare a set of data that de-dupes but also contains the prior row's min start and max end (which you call previous_start and previous end) which you'll use in the second part of your algorithm...interesting note tho, that the reference to start_dt/end_date that is used in the window functions are coming from the portioned set of data from intervals, and not from the as start_dt
that you are projecting from the column 2 and 3 of CTE1. I'm sure you know that, but just pointing out that it's a little confusing when you yield a column from a window function that is the same name as the source column.
Then when you get into CTE2, you need the result of CTE1 in order to have the min/max/previous set 'yielded' so you can do your case statements around a more readable form, tho without ctes, it's simply:
select person_id, end_date,
case when previous_start_dt is not null and start_dt between previous_start_dt and previous_end_date
then previous_start_dt
else start_dt end as valid_start
FROM (
select distinct
person_id,
min(start_dt) over (partition by person_id, end_date) as start_dt,
max(end_date) over (partition by person_id, start_dt) as end_date,
min(start_dt) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_start_dt,
max(end_date) over (partition by person_id order by start_dt rows between unbounded preceding and 1 preceding) as previous_end_date
from intervals
) cte1
So, from a performance perspective, I'd compare distinct (or group by), partitions and sorts to try to evaluate the computational load difference between them. I'd also rely on dbms-specific tools like explain
or show plan
to see where the sql engine is performing sorts and memory swaps, etc. That would give you the exact computational load, with the cost of having to digest and interpret the runtime statistics (which may be different based on memory availability, set size, etc)...so sometimes I just look at the logical ops in the query manually.
This is where I think you may be onto something. I think the main difference between the original implementation is the original finds ends by using the sum(event_type) (which sums -1s and +1s until you get to a zero) to identify the actual end dates, while your approach is using a case/if statement to decide if a row should start a new interval when the current start is later than the maximum prior ends ie: when you have a start date that is later than any of your previous end dates, that means you've hit a start date that is a new interval. That will yield you your 'valid starts' as you say. Then once you group your valid starts to the max(ends) from the prior (including current) rows, you have your eras.
So I think where it might be a possible improvement is that by using the case logic with 'is this row's start date later than previous row's max(end) and the end dates are always the maximum end_dates from the prior rows (I think you have to include the current row in this), then that eliminates the need to 'extract' start/end dates into a single set where you sum() to find which of the dates are actual end dates. This would reduce the algo to just sort by start_date (within a partition of person_id), and you use the window function's lookback (rows unbound preceding) to find some value within that partition/ordered set. I think it would just be one partition/order operation, which would reduce the computational load by a lot (theoretically).
I think the initial de-duping logic is unnecessary here and in most cases is pointless work: it's not typically the case that you have a ton of matching start and end dates that you would be able to partition together: if you look at a patient profile, events are scattered all over the place, and it's really unlikely that you see things end all at the same date (tho you might see a few things start on the same date because it is possible events originate out of the same healthcare encounter, but then they may have different durations afterwards. It would be interesting what your query looks like if you remove the prep work of de-duping starts/ends and just take the raw data, sort it by start date, and use your case/when statement to determine if you've reached an interval based on the current start_date being later than the latest end date from the prior rows.
I see what you're going after by tying to achieve 'completeness' by covering all the cases described in Allen's interval algebra, but having intervals that exhibit those relationships doesn't guarantee you have covered all your cases: the collapsing logic is complicated because of the different ways an era can be 'extended' by the combination of overlapping rows that are formed across multiple records. So while you will certainly see instances of Allen's intervals in yoru test cases, it's a combination of those intervals in creative ways that will get you to completeness, not that you've put one of those intervals in each case. That's just my gut instinct...but I think to call this solution 'complete' (or full test coverage) we'd want to define the individual cases (by dataset examples) that illustrate a logical challenge for the algorithm, and then demonstrate the correct result. So things like 'duplicate intervals', 'sequential extension', 'one-row-to-rule-them-all' would be some examples of special circumstances that the algorithm should be able to deal with.
Ok, that's all I got at 1:30am! Hope this makes sense, and looking forward to another round of dialog!
Thanks Chris. 1:30! Hope you're sleeping in this morning!!
There's a lot to respond to but I found something I wanted to run past you. I'm trying to implement the current query so I can make a comparison and found something very weird. It seems to give non-deterministic results! At least on duckdb.
Here is my reprex. I ran a simple example 1000 times. ~33% of the time it gives the incorrect result! what the?
Am I doing something wrong here?
library(dplyr, warn.conflicts = F)
test_current_query <- function(x) {
raw_data <- tibble::tribble(
~person_id, ~start_date, ~end_date,
46, "1991-11-29", "1991-11-29",
46, "1991-11-29", "2006-01-27",
46, "1994-12-16", "1994-12-16"
) %>% dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "raw_data", raw_data)
result <- DBI::dbGetQuery(con, "
with cteEndDates (person_id, end_date) AS
(
SELECT
person_id,
event_date as end_date
FROM
(
SELECT
person_id,
event_date,
SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
FROM
(
SELECT
person_id,
start_date AS event_date,
-1 AS event_type
FROM raw_data
UNION ALL
SELECT
person_id,
end_date AS event_date,
1 AS event_type
FROM raw_data
) RAWDATA
) e
WHERE interval_status = 0
),
cteEnds (person_id, start_date, end_date) AS
(
SELECT
c.person_id,
c.start_date,
MIN(e.end_date) AS end_date
FROM raw_data c
INNER JOIN cteEndDates e ON c.person_id = e.person_id
AND e.end_date >= c.start_date
GROUP BY
c.person_id,
c.start_date
)
select person_id, min(start_date) as start_date, end_date
from cteEnds
group by person_id, end_date
;") %>% dplyr::tibble()
DBI::dbDisconnect(con, shutdown = TRUE)
return(result)
}
# run the test 1000 times
results <- tibble(run = 1:1000, df = purrr::map(1:1000, test_current_query))
# summarize test results
summary <- results %>%
count(df)
summary
#> # A tibble: 2 × 2
#> df n
#> <list> <int>
#> 1 <tibble [1 × 3]> 679
#> 2 <tibble [2 × 3]> 321
summary$df[[1]]
#> # A tibble: 1 × 3
#> person_id start_date end_date
#> <dbl> <date> <date>
#> 1 46 1991-11-29 2006-01-27
summary$df[[2]]
#> # A tibble: 2 × 3
#> person_id start_date end_date
#> <dbl> <date> <date>
#> 1 46 1991-11-29 1991-11-29
#> 2 46 1994-12-16 1994-12-16
Created on 2023-06-22 with reprex v2.0.2
Opened issue on duckdb https://github.com/duckdb/duckdb/issues/8041
Here's the similar thing on postgres.
library(dplyr, warn.conflicts = F)
library(DatabaseConnector)
cd <- createConnectionDetails("postgresql", user = "postgres", password = "", server = "localhost/covid")
con <- connect(cd)
#> Connecting using PostgreSQL driver
raw_data <- tibble::tribble(
~person_id, ~start_date, ~end_date,
46, "1991-11-29", "1991-11-29",
46, "1991-11-29", "2006-01-27",
46, "1994-12-16", "1994-12-16"
) %>% dplyr::mutate(dplyr::across(dplyr::matches("date"), as.Date))
DBI::dbWriteTable(con, "scratch.raw_data", raw_data, overwrite = T)
#> Inserting data took 0.0251 secs
test_current_query <- function(x) {
result <- DBI::dbGetQuery(con, "
with cteEndDates (person_id, end_date) AS
(
SELECT
person_id,
event_date as end_date
FROM
(
SELECT
person_id,
event_date,
SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
FROM
(
SELECT
person_id,
start_date AS event_date,
-1 AS event_type
FROM scratch.raw_data
UNION ALL
SELECT
person_id,
end_date AS event_date,
1 AS event_type
FROM scratch.raw_data
) RAWDATA
) e
WHERE interval_status = 0
),
cteEnds (person_id, start_date, end_date) AS
(
SELECT
c.person_id,
c.start_date,
MIN(e.end_date) AS end_date
FROM scratch.raw_data c
INNER JOIN cteEndDates e ON c.person_id = e.person_id
AND e.end_date >= c.start_date
GROUP BY
c.person_id,
c.start_date
)
select person_id, min(start_date) as start_date, end_date
from cteEnds
group by person_id, end_date
;") %>% dplyr::tibble()
return(result)
}
# run the test 1000 times
results <- tibble(run = 1:1000, df = purrr::map(1:1000, test_current_query))
# summarize test results
summary <- results %>%
count(df)
summary
#> # A tibble: 1 × 2
#> df n
#> <list> <int>
#> 1 <tibble [1 × 3]> 1000
summary$df[[1]]
#> # A tibble: 1 × 3
#> person_id start_date end_date
#> <dbl> <date> <date>
#> 1 46 1991-11-29 2006-01-27
Created on 2023-06-22 with reprex v2.0.2
So maybe the issue was duckdb. Weird.
The part that is non-deterministic is this:
SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
If you don't include event_type
in the sum() order by, then it may sum a -1 first in one run, and a +1 in another run. We want to ensure that -1 comes before +1 in the row order.
Ie:
SUM(event_type) OVER (PARTITION BY person_id ORDER BY event_date, event_type ROWS UNBOUNDED PRECEDING) AS interval_status
If it's not doing that in the CohortIncidence script, its' a bug that should be fixed.
Update, looks like it is wrong, I'll apply a fix to this.
Update 2: I published a hotfix, this is the specific change that addresses it.
Another note on the algo comments I made: if I understand correctly that starts are those start_dates where it is greater than any of the prior end dates (not including the current row), then ROWS UNBOUND PRECEEDING does include the current row, and you'll want to adjust that ROWS statement to use all prior but not current row. This is done with:
rows between unbounded preceding and 1 preceding
@ablack3 ,
I'm running a very large scale CohortIncidence for HowOften. 99% of the work is being done in creating the eras of TAR and Excluded Time, so I've decided to revisit this issue and get to Era-fy SQL 2.0. Here's what I have come up with.
select person_id, min(start_date) as era_start, max(end_date) as era_end, count(*)
from (
select id, person_id, start_date, end_date, sum(is_start) over (partition by person_id order by start_date, id rows unbounded preceding) group_idx
from (
select id, person_id, start_date, end_date,
max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) latest_end,
case when max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) >= start_date then 0 else 1 end is_start
from #RAW_EVENTS
) E
) G
GROUP BY person_id, group_idx
Simpler, eh?
Here's the logic:
max(end_date) ... rows between unbounded preceding and 1 preceding
will be null for the first row) OR if the current row's start date is later than the maximum end_date of the preceding rows. This query yields the is_start
= 0 or 1 depending on this result.is_start
to create a 'grouping index' (group_idx) that will bring all the records together that belong to the same era. min(start_date)
and max(end_date)
grouped by the grouping index to find our eras.Some notes on this approach:
We have a row ID to 'break ties when there are duplicate start dates. Whenever we have that sort of issue in data, we run into the 'indeterministic problem' you ran into with DuckDB. The alternative is to remove dupes by doing the technique you had earlier to group by start_date, max(end_date) to bring events starting on same date together. However, even with the non-deterministic nature of it, I don't think it matters in this case: The inner query's 'latest end' is just a debugging column and we ran into a deterministic issue when we needed the sort order of one partition to line up with a sort order of another. In the final query I'd drop the latest_end
column.
In this new approach, we only need the one partition/sort order to find the rows that represent starts, and a separate, unrelated partition/sort order to sum up the group indexes. (Side note: I attempted to both operations in one query context, but it required 'nesting window functions' which is not allowed.) In addition, there are zero joins in this approach at all, but depending on how the DBMS implements subqueries, there could be 'intermediate' results from subquery E and G, but a smart optimizer may be able to combine those together so that there's just a single intermediate result.
So, this form reduces partition/sorts and joins, so technically should run much more efficiently. Would you mind taking this for a spin? Below is my test code that tests a few test cases:
-- DROP TABLE #RAW_EVENTS;
CREATE TABLE #RAW_EVENTS (
id bigint not null,
person_id bigint not null,
start_date date not null,
end_date date not null)
DISTKEY (person_id);
TRUNCATE TABLE #RAW_EVENTS;
-- TEST 1
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-15'),
(2, 1, '2001-01-01', '2001,02-15'),
(3, 1, '2001-01-13', '2001,01-13'),
(4, 1, '2001-01-14', '2001,02-03'),
(5, 1, '2001-02-01', '2001,04-01'),
(6, 1, '2001-05-01', '2001,05-15'),
(7, 1, '2001-05-05', '2001,05-10'),
(7, 1, '2001-01-01', '2001,06-01')
;
-- Test 2
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-15'),
(2, 1, '2001-01-01', '2001,01-15'),
(3, 1, '2001-01-01', '2001,01-16'),
(4, 1, '2001-01-01', '2001,01-15')
;
-- Test 3
INSERT INTO #RAW_EVENTS values
(1, 1, '2001-01-01', '2001,01-02'),
(2, 1, '2001-01-02', '2001,01-03'),
(3, 1, '2001-01-03', '2001,01-04'),
(4, 1, '2001-01-04', '2001,01-05')
;
select * from #RAW_EVENTS;
select person_id, min(start_date) as era_start, max(end_date) as era_end, count(*)
from (
select id, person_id, start_date, end_date, sum(is_start) over (partition by person_id order by start_date, id rows unbounded preceding) group_idx
from (
select id, person_id, start_date, end_date,
max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) latest_end,
case when max(end_date) over (partition by person_id order by start_date, id rows between unbounded preceding and 1 preceding) >= start_date then 0 else 1 end is_start
from #RAW_EVENTS
) E
) G
GROUP BY person_id, group_idx
;
Woah yes very nice Chris! So my next task on this issue is to write a test that I want to run on all OHDSI databases. I'll try to get this done before symposium. It's a busy time but it's on my list. Yea I really like this new query. It is easier to read anyway.
I've been looking at the drug_era SQL and may have found a small simplification that always gives the same results. The query I'm looking at is the one that identifies the end dates of each ingredient exposure. I have compared my new query to the existing query on every case of two ingredient exposures that I can think of. I would like to be able to claim that if the query gives correct results for any set of two ingredient exposures then it will work for any number of ingredient exposures but I don't know if I can make that claim. Since the entire query is partitioned by ingredient_id and person_id I don't multiple ingredients or multiple persons will alter the correctness of the results. The algorithm runs on person-ingredient combinations and each person-ingredient combo is independent of all the others. Some assumptions I'm making about the input are that start date and end dates are populated and end date is at least 30 days after start date. I think these assumptions are guaranteed by the SQL statement that precedes the end date identification.
Here is my reprex:
Created on 2021-11-27 by the reprex package (v2.0.1)
Is it possible for these two queries to give different results as long as the assumptions about the input are correct?