OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Avoid subqueries for primary_events #192

Closed egillax closed 1 year ago

egillax commented 1 year ago

We are using postgres 12. When executing a certain cohort (attached) it never finishes (gave up after 17 hours). This is fixed by first computing the primary_events and storing into a temporary table and then running the qualified_events part. Then it runs in 10 seconds. On redshift this change gives the same performance as before.

Applying the suggested fix in #101 (which seems to have been ignored) helps the optimizer and takes 40 seconds with the original version (on postgres). So it helps the optimizer more to use an explicit temp table for primary_events.

Is there a reason primary_events needs to be in a subquery? Would it be possible to first extract it into a temporary table before running the qualified events?

depressionCohort.txt

chrisknoll commented 1 year ago

This should be addressed in version 1.10.1 since we have removed CETs in favor of temp tables in this PR. Which version are you using (Atlas/Webapi or otherwise)?

egillax commented 1 year ago

Thanks @chrisknoll for the speedy response. The queries I tried were from Atlas/Webapi 2.13. I think that's with the improvements from the mentioned PR?

Looking at the code in generateCohort.sql, the changes I'm requesting are something like:

original:

SELECT event_id, person_id, start_date, end_date, op_start_date, op_end_date, visit_occurrence_id
INTO #qualified_events
FROM 
(
  select pe.event_id, pe.person_id, pe.start_date, pe.end_date, pe.op_start_date, pe.op_end_date, row_number() over (partition by pe.person_id order by pe.start_date @QualifiedEventSort) as ordinal, cast(pe.visit_occurrence_id as bigint) as visit_occurrence_id
  FROM (@primaryEventsQuery) pe
  @additionalCriteriaQuery
) QE
@QualifiedLimitFilter
;

suggestion to help postgres:

SELECT * 
INTO #primary_events
FROM (@primaryEventsQuery)

SELECT event_id, person_id, start_date, end_date, op_start_date, op_end_date, visit_occurrence_id
INTO #qualified_events
FROM 
(
  select pe.event_id, pe.person_id, pe.start_date, pe.end_date, pe.op_start_date, pe.op_end_date, row_number() over (partition by pe.person_id order by pe.start_date @QualifiedEventSort) as ordinal, cast(pe.visit_occurrence_id as bigint) as visit_occurrence_id
  FROM #primary_events pe
  @additionalCriteriaQuery
) QE
@QualifiedLimitFilter
;
chrisknoll commented 1 year ago

So, yes, the 2.13 does have the changes, and the change was that there used to be a CTE called qualified events which we put into a temp table. The problem on platforms like RedShift is that they see a CTE and make a temp table out of it, but the 'distribution' might be something like RANDOM or ALL (which would copy the temp table across all the clusters). It would also loose any indexing or table statistics related to the underlying queries (in the case of postgres)

The new form puts the query primary events into a subquery. The reason for this is that sometimes people would use 'all events' as primary events which we wouldn't want to create a copy of the VISIT_OCCURRENCE table into TEMP, which is bad.

I looked at your cohort definition, and it is extremely simple: it queries the CONDITION_OCCURRENCE table for codeset 6, then does an inclusion criteria using codeset 7. My impression is that you have an environment problem where you either do not have indexes set up on these tables to be able to find rows by concept_id, or you may have memory configuration set up properly (like workmem) to handle the work load. You may also need to update statistics on the table or VACCUUM.

Based on our testing, we are fairly confident that the structure of queries we have in this case should work properly, and leverage underlying table statistics to optimize the fetches from the table. I understand that it did run faster when you restructured the code to create the first temp table, but at the same time things may have been cached differently or the sub-query form may have violated a 'query optimizer' threshold such that it performed a brue-force but slow approach to fetching records (although, there is really only 2 or 3 tables involved wit this...so I'm not sure).

I would go into Atlas and export the PostgreSQL SQL from the UI (it's under the export tab in cohort definition). Then put that into a PG admin session and execute the initial queries using EXPLAIN, so that you can see if it is using indexes on concept_id and sort orders on start_dates. I think it will report if it's using local memory or needs to swap out...you want to avoid swaps by increasing work mem. There's a number of online resources that can assist you with tuning your environment. Hosting a CDM on a PostgreSQL instance is not the same as hosting a chat or forum application! Typiclaly there's way more data that is searched based on date ranges, and internal sorting, so the memory and IO constraints will be different than your normal case.

egillax commented 1 year ago

@chrisknoll,

We had already modified the settings quite a bit, triggered by this issue, which sped up all other queries we tried significantly. I'll explore if I can get the query to run better by modifying the settings based on what the EXPLAIN will tell me.

chrisknoll commented 1 year ago

Thanks. Let me know how it goes and if the evidence is overwhelming, then we'll try to make the changes to address your findings

chrisknoll commented 1 year ago

Any update on this, @egillax ? We'd really like to avoid the approach of turning every subquery into a temp table, but I also am sensitive that this could be a large performance gain for you. Please let us know your current findings.

egillax commented 1 year ago

hi @chrisknoll ,

Thanks for the follow up. I've tried a few things. First I upgraded the database to postgres 15 hoping there were some updates to the query optimizer. Then I tried increasing statistics to max on the columns in the query. None of those allowed the qualified_events query to run.

But as had been noted before, adding an ANALYZE Codesets after inserting that table does allow the query to run. So I did create EXPLAIN ANALYZE query plan for that case which can be seen here. That can be compared to EXPLAIN for the case that doesn't run here

This is just for the qualified_events part of the query.

I'm not an expert but it seems it's favoring nested loop at some points due to underestimation of rows.

Maybe a solution could be to add that ANALYZE for the codesets so it runs at least. But maybe that's an issue for sqlRender? It seems here that it is often adding an ANALYZE when creating temp tables.

chrisknoll commented 1 year ago

Yeah, the sql render does seem to add analyze to the statement when the form create #table as select... but not if you create the table first and then perform inserts on it (which is done when we create the codeset temp table table).

If you look at this export on the atlas demo site for this cohort def, you can see that there is an ANALYZE qualified_events call, but not for codesets because qualified events is CTAS, while codesets is create-then-insert.

I think I can change the codesets table to be a CTAS with a select...union all...select as it appears to currently be a INSERT INTO .. select ...union all...select. However, I thought this was a CTAS once apon a time, and this was changed back into CREATE/INSERT INTO....but I can try a branch where it makes a CTAS (and therefore will add the ANLALYSE to it) and see if that works better for you.

chrisknoll commented 1 year ago

Hi @egillax ,

I was fortunate to find that SqlRender will convert UPDATE STATISTICS into ANALYZE, and so I've applied an ANALYZE call to after the codeset table is created.

I'm not sure how you'd like to test this on your env: I can create a branch over at circe-R where you can invoke the sql generation to get the SQL from your cohort expression.....the alternative is to make some changes in your WebAPI build to point to the custom-branch of circe-be to load into your WebAPI. Which would be easier for you?

I can also post the updated sql from a cohort definition that you have, and I can send it back to you to see how it runs?

Effectively, the main change is to add ANALYZE #codesets to the output SQL which from your report above will resolve into a different join strategy that seems to work more efficiently.

Let me know how you'd like to proceed.

egillax commented 1 year ago

Hi @chrisknoll ,

Thanks! I think for me getting an updated sql from you and running it would be fine. If you think more testing is better I think the circe-R option would be quicker for me.

chrisknoll commented 1 year ago

Ok, I pushed up a circeR branch, you can install with:

 remotes::install_github("ohdsi/CirceR", ref="circe-issue-192")

This ref has the circe branch with the sql optimization. let me know if you need help reading a JSON and loading it into the package.

egillax commented 1 year ago

Hi @chrisknoll ,

I'm getting a syntax error when running the query generated with that branch:

ERROR: syntax error at or near "ANALYZE"

Here's the sql error report output:

postgresql

Error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "ANALYZE"
  Position: 3769

SQL:
INSERT INTO Codesets (codeset_id, concept_id)
SELECT 6 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (440383,4175329,40546087,4191716,4212469)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (440383,4175329,40546087,4191716,4212469)
  and c.invalid_reason is null
) I
LEFT JOIN
(
  select concept_id from cdm.CONCEPT where concept_id in (436665,442306,36684319,438727,40481798,443864,4239471,377527,435520,4224940,379784,433440)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (436665,442306,36684319,438727,40481798,443864,4239471,377527,435520,4224940,379784,433440)
  and c.invalid_reason is null
) E ON I.concept_id = E.concept_id
WHERE E.concept_id is null
) C UNION ALL 
SELECT 7 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (436665)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (436665)
  and c.invalid_reason is null
) I
) C UNION ALL 
SELECT 8 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (4286201)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (4286201)
  and c.invalid_reason is null
) I
) C UNION ALL 
SELECT 9 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (433734,435235,433742,435783,440077)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (433734,435235,433742,435783,440077)
  and c.invalid_reason is null
) I
LEFT JOIN
(
  select concept_id from cdm.CONCEPT where concept_id in (434318)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (434318)
  and c.invalid_reason is null
) E ON I.concept_id = E.concept_id
WHERE E.concept_id is null
) C UNION ALL 
SELECT 10 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (4182210,372608,4043378,373179,4104700,435088,4168666,380701,374009,4009705)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (4182210,4043378,373179)
  and c.invalid_reason is null
) I
LEFT JOIN
(
  select concept_id from cdm.CONCEPT where concept_id in (4152048,376095,37116464,36717598,377788,42535731,37311999,372610)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (4152048,376095,37116464,36717598,377788,42535731,37311999,372610)
  and c.invalid_reason is null
) E ON I.concept_id = E.concept_id
WHERE E.concept_id is null
) C UNION ALL 
SELECT 11 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from cdm.CONCEPT where concept_id in (436073)
UNION  select c.concept_id
  from cdm.CONCEPT c
  join cdm.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (436073)
  and c.invalid_reason is null
) I
) C
ANALYZE Codesets

R version:
R version 4.1.2 (2021-11-01)

Platform:
x86_64-pc-linux-gnu

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- CirceR (1.3.1) 

Could here be a missing ";" before the ANALYZE Codesets ?

chrisknoll commented 1 year ago

Yes, let me figure out where that messed up. I caught that typo before but thought I addressed it.

chrisknoll commented 1 year ago

Ok, I verified the tests in the Circe java side were properly constructing the sql, and so I did a complete refresh of the libraries in the CirceR package and re-pushed. When it comes to java-based packages in R, things can get tricky: when you pull and update, try to re-build the package from source in a completely new R session (RStudio tends to hold onto the java package in a way that won't let you update the package without a fresh R session).

To test, once you pulled the package branch and did a Build->Clean Install in RStudio, you can verify the sql is rendering by modifying a test and using devtools::test()

In test-CohortGenerateSql.R, modify the line that tests the json-to-sql:

test_that("cohortSql is generated", {
  cohortJson <- paste(readLines("resources/simpleCohort.json"),collapse="\n");
  cohortExpressionObj <- cohortExpressionFromJson(cohortJson);
  options <- createGenerateOptions(cohortId = 1, cdmSchema = "CDMSchema", targetTable = "cohort", resultSchema = "ResultSchema");
  cohortSql <- buildCohortQuery(cohortExpressionObj, options);
  browser()  # this will break the debugger in devtools::test()

  expect_true(nchar(cohortSql, keepNA=TRUE) > 0)
})

When you run devtools::test(), you'll get a breakpoint, and you can cat(cohortSql) to see what the output is. The INSERTs should be separated from the UPDATE STATISTICS by a semicolin.

Also remember that if you are running this against your DBMS, you need to run SqlRender::translate() in order to translate from the OHDSI-sql into your target dialect.

Edit: I realize that you aren't doing a build from source, but rather you are using remotes::install_github to install....that should be fine to do, but remember to do it in a fresh R Session. If possible, try to remove the existing install from your packages. If it is still giving you problems, maybe try to build from source (but I hope I just messed up a push, and doing an install_github again will get you the correct code).

chrisknoll commented 1 year ago

Last update, things should be working: I had a bad commit where the java directory was deleted (probably when I was trying to force a refresh of the libs) and I pushed a commit for about 15 mins that had no JARs, but I have fixed this in the past 5 minutes, so if you had problems with devtools::install(), it should all be working now: I removed my local package and installed via devtools and I ran some test code that works properly. Hopefully it will work now for you.

egillax commented 1 year ago

Hi @chrisknoll ,

The query works now!

chrisknoll commented 1 year ago

Thanks for verifying...does it give the proper execution plan?

egillax commented 1 year ago

It gives the same execution plan as when I added the manual analyze Codesets above. It converts some nested loop joins to merge joins. So it runs in about 40 minutes instead of not running at all.