OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Window Criteria date offsets are not handling inclusive dates properly #87

Closed chrisknoll closed 5 years ago

chrisknoll commented 5 years ago

Window Criteria (those criteria where you look for events between X days before/after index start/end and Y days before/after index start/end are using the following SQL (for example: 0 days before to 7 days after: AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE <= DATEADD(day,7,P.START_DATE)

The problem is that if you are saying you want to find an event with the 7 day timespan, (ie, the 7 day interval after 1/1/2010 would go from 1/1/2010 to 1/8/2010 (1/1/2010 + 7d = 1/8/2010), but you do not include 1/8/2010, since the interval ends at 1/8/2010. So, logically the clause should read:

AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE < DATEADD(day,7,P.START_DATE)

However, things get confusing if you want to talk about events that happen on the same day as index. For example, it's a common expression to say 'occurs between 0 days before and 0 days after index', which, if we fixed it as the above, we would get: AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE < DATEADD(day,0,P.START_DATE)

Which would not match any dates because if your P.start Date is 1/1/2010, there are no dates that are >= 1/1/2010 and < 1/1/2010. Instead, the correct way to express the date range is to specify 0 days before to 1 day after:

AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE < DATEADD(day,1,P.START_DATE) Which, if our P.START_DATE = 1/1/2010, then it will match any date/time between 1/1/2010 and up to, but not including 1/2/2010.

The issue here isn't so much a technical one, but rather a user experience/training one: the guidance that has been given to users so far (and we have real-world examples for this) is that if you're looking for an event on index, you say 'occurs between 0 days before and 0 days after index'. With the current logic of 0<=x<=0, day - will match. but the new logic will mean 0<=x<0, and no value of X will match. To handle this, if the expression states 0 days before and 0 days after, should that automatically adjust to 0 days before and 1 day after? And, should this be generalted that if they say X days before and X days after, we adjust the case where X=X to be X days before and X+1 days after, such that "between 7 days before and 7 days before" becomes:

A.START_DATE >= DATEADD(day,-7,P.START_DATE) AND A.START_DATE < DATEADD(day,(-7+1),P.START_DATE) Creating the range that goes 7 days before to 6 days before.

I'm not sure if automatically fixing this for the user is the right approach, as it can lead to the tool 'thinking' for the user, and the user should be in control of those decisions. This is the same case as writing a query in SQL as : where 0=1, which never returns rows, but behaves logically consistent and the results can be explained easily.

clarkevans commented 5 years ago

There is a related 2015 forum thread

I would also look into how the nomenclature used could be brought into alignment with HL7's Clinical Quality Language, in particular timing phrases and interval operators and interval calculations.

chrisknoll commented 5 years ago

Thanks for the tip, @clarkevans . The sections under 5.3 seems to be relevant, but one thing that CQL seems to try to handle that I don't think I will ever do is the notion of uncertainty when trying to derive a difference in dates where one is granularity of a month while the other is a day. I'm not going to try to incorporate that complexity into circe.

The area where I think there is some agreement is how date differences are calculated: their example says:

days between Date(2014, 1, 15) and Date(2014, 2, 1) // 17 days
days between Date(2014, 1, 15) and Date(2014, 2, 28) // 44 days

Which is exactly what the sql select datediff(d, datefromparts(2014,1,15), datefromparts(2014,2,1)) yields.

So, the main difference is that CIRCE works in the context of specifying time windows. CQL uses syntax like 'same day or before X' or 'before X', while CIRCE works in the form from X days before/after event to Y days before/after event. Since we're using dateadd() (which results in the same logic as datediff()) when we say 5 days before index, we're saying the start of the interval is index - 5 days. Likewise when we are specifying the end interval to be 10 days after index, it is index + 10 days. So, if we had an index on 1/10/2010, and applied the calculation to this interval [1/5/2010 , 1/20/2010], The question is: should an event found on 1/20/2010 be included (ie: is the upper bound inclusive)? If we use datediff on an event of 1/20/2010, we get datediff(d, datefromparts(2010,1,10), datefromparts(2010,1,20)) = 10 so an event that is 10 days apart should be included. So, I think that means it is actually correct to say

AND A.START_DATE >= DATEADD(day,-5,P.START_DATE) AND A.START_DATE <= DATEADD(day,10,P.START_DATE)

So, the current implementation may have been correct all along! If I thought about this in terms of hours, and I went 24 hours ahead from today at 3pm, i would wind up at 3pm tomorrow...3pm tomrrow is within 24 hours of 3pm today. But 3:01 pm tomorrow is more than 24 hours from 3pm today. So, with that logic, it feels like it is the correct behavior to use start >= x <= end. This also lets things like 1/1/2001 <= x <= 1/1/2001 to return when x = 1/1/2001.

Make sense?