OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Cohort definitions group criteria combination shows unexpected behaviour (with Oracle) #209

Closed thisismexp closed 5 months ago

thisismexp commented 5 months ago

The following problem occurs when defining cohort definitions with nested criteria having only one criterion in a group.

Expected behavior

Selecting any vs. all in the "having any/all of the following criteria" should not affect the generated cohort if only one sub-criterion is used within that criteria group: Screenshot from 2024-05-22 11-54-34

Actual behavior

In our example, selecting all generates a cohort with 3 people included, whereas selecting any leads to 940 persons being included. This problem also affects groups in inclusion criteria definitions.

Steps to reproduce behavior

We use Oracle 19c as DBMS (the issue seems not be present with other dialects, at least not with Postgres), find a exemplary cohort definition attached (test_cohort_definition.json).

Further information

The only difference in the generated SQL statements is the following: Screenshot from 2024-05-22 11-52-35

If this is an issue of WebAPI, pls feel free to close this one and I will go ahead and open another one in the WebAPI repo.

Thanks for your help :)

chrisknoll commented 5 months ago

This would be an issue in Circe, but, I'd be interested to know why you're getting this result on oracle...the count() expresison there is counting the index_id (which if it was multiple expressions in the group you'd have index_id = 1, index_id = 2, etc for each group. The ALL means that you have to find a count that matches the number of expresisons (so it would be = 2 if it was an AND between 2 criteria expressions. ANY mean you have to meet at least 1 (ie: count() > 0).

Since you have the sql, could you narrow down why you only find 3 in the one case vs 940 in the other? ALL of the one thing vs ANY of the one thing is equivalent logic, so I would be extremely concerned if it did not produce the same result.

thisismexp commented 5 months ago

We discovered our mistake and I will go ahead and close this issue, sry for opening it in the first place.

We had issues with our OBSERVATION_PERIOD entries (duplicates, overlapping ones) that, due to the contained JOIN operation on OBSERVATION_PERIOD, let to the unexpected number of matched expressions. Although this should be prevented by the CDM definition itself, using count(DISTINCT index_id) would prevent this from happening at all and might be useful to include.

Thank you so much for your help, your comment ultimately pointed me to our issue.

chrisknoll commented 5 months ago

Although this should be prevented by the CDM definition itself, using count(DISTINCT index_id) would prevent this from happening at all and might be useful to include.

But then you wouldn't have discovered the OP issue you had and other analyses would break because of it. So we'll leave it as is and leave this as a cautionary tale to follow the CDM specs.