OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data
http://atlas-demo.ohdsi.org/
Apache License 2.0
265 stars 133 forks source link

Quantitative relations between events for cohort definitions #902

Open eldarallakhverdiiev opened 6 years ago

eldarallakhverdiiev commented 6 years ago

I faced with problem of filtering persons in cohort definitions using not only timing relations between events but quantitative ones as well. The most recent use case: knowing that data contains information about Neutrophil and Lymphocyte counts find persons with Neutrophil/Lymphocite ratio > 5. Currently there is only possibility to find the ratio between value and reference ranges for a given measurement,but not between two

Current solution is to include such generic distributions while ETL process which is, due to my opinion, not the best one (at least because of two reasons: we are duplicating data and we need to know all research questions before data conversion while I don't like result oriented data manipulations)

I wonder whether there are some similar use cases and if this can be (and should be) implemented in Cohort Definition tab? I think about implementation of such option as part of Nested Criteria. @chrisknoll , @pavgra , any thoughts about it?

pavgra commented 6 years ago

@eldarallakhverdiiev, interesting use-case! I believe that we could combine this thing with the functionality discussed in Stage 3 of https://github.com/OHDSI/WebAPI/issues/495#issuecomment-412667979 and cover them together

chrisknoll commented 6 years ago

I have thought about this case: I think it might just be a matter of adding a new criteria option for Measurements to look for a relative increase/decrease...where if you set that option, it will just do a sub-join in the query to look at the measurement table again, and join on the measurement_concept_id, the date (to look before or after) and do the math to compare the first value to the subsequent/prior value. And if that ratio meets a threshold, you are in.

The consideration here is that it will have to look for the exact same concept, and assume the exact same units (or we can enforce the units match when looking for the other record:

FROM Measurement M
-- if we are looking for an increase/decrease
WHERE exists (
  select 1 
  from MEASUREMENT rel 
  where rel.measurement_concept_id = M.measurement_concept_id
  and rel.unit_concept_id = M.unit_concept_id
  and rel.measurement_id <> M.measurement_id -- don't look at the same measurement
  and rel.value_as_number / M.value_as_number > @targetRatio
  and rel.measurement_date between .... -- set up the prior/post window to look for the other measurment

Something like that. It's like a nested criteria, but a very specific application of it. I don't think nested criteria would be the easier approach, but if we extended the entire criteria capability to allow arbitrary joins between records, then we could do it that way too, but at that point, we're incresing the complexity of the UI to be more like a sql builder. The way I propose would have a simple UI element:

[x] having [increase|decrease] of measurement value within [prior|post] [days] days.
eldarallakhverdiiev commented 6 years ago

Your approach seems pretty good. The reason why I think about nested criteria is because of comparison between different measurement_concept_id's in my problem (here I can add blood albumin/globulin and direct/indirect bilirubin as potential use case). I might sound too pedantic, but we then also can have issue to compare measurement values with observation values or specimen quantities.

In this case UI should capture a lot of options to specify: the same or different measurements; the same or different units (and if we need to convert standard units - this headache will go to vocabulary support as well);
increase or decrease;
absolute or relative(subtraction or division) ; value range (equal to/not more than/ not less than/ between); time frame (or more terrible thing - use the earliest/latest/any measurement within a given period); limit to the same visit occurrence. Such complexity (if it's required) turned me to think about creation of separate event block

chrisknoll commented 6 years ago

The reason why I think about nested criteria is because of comparison between different measurement_concept_id's in my problem (here I can add blood albumin/globulin and direct/indirect bilirubin as potential use case)

We should explore that use case further. It sounds to me like the metaphorical 'apples to oranges' comparison if you allow comparison between different measurement concept_ids, but I'm happy to see a real use case.

Re: untis: I've long criticized the cdm's lack of standardization on units. Yes, they only have one representation of 'kilogram' or 'gram' or 'miligram' as a unit in the vocabulary, so those are the 'standard concepts' but you can't do anything in a study where you are looking to compare weights when sometimes it's in grams, others in kilos, others in pounds, other times in stones. So, until that's solved, the only reasonable thing to do is just compare apples-to-apples by looking at the same concept and same unit. I wouldn't be supportive of doing on-the-fly unit standardization during the cohort generation process.

Such complexity (if it's required) turned me to think about creation of separate event block

Yes, this is another option: to introduce a new criteria construct specifically for complex rules around 'changes over time' such as being able to calculate a moving average of the past 5 measurements, or finding the values in a given window and creating a least-mean-square line for the data, and using the slope of the line to determine 'rate of change'...

But, my suggestion was just measurement focused because that's where values within the CDM should be found (I thought that observation + value = measurement?) But we should discuss further.

eldarallakhverdiiev commented 6 years ago

@chrisknoll, Neutrophill/Lymphocyte count ratio and Albumin/Globulin ratio are the real use cases. You can also find use case raised in forum The point is: if the purpose of lab test was exactly to define generic value which is based on 'apple to orange' comparison(and needs some calculations to be done) then it is probably presented in the results as a separate value. But if no (there were preliminary tests done for high-level evaluation) - then clinician will do it himself. So in practice 'apples' often are compared to 'oranges' : blood pressure to heart rate for defining circulatory shock, difference between systolic and diastolic blood pressures for heart risk disease etc. We can find most of these as ready concepts, but the problem is that they are not always presented in source data and there is a need to calculate it.

chrisknoll commented 6 years ago

I see, yes, I wasn't thinking about ratio construction (where the numerator unit will most definitely be a different fruit than the denominator), so if I understand you correctly, you're describing a need where a value is inferred by taking 2 different records of data, and producing a new value that is the one value divided by the other (in the case of a ratio)?

I asked Doctor Google about the Neutrophill/Lymphocyte count ratio and he referred me to Professor Wiki, and I found that they describe it by:

It is calculated by dividing the number of neutrophils by number of lymphocytes, usually from peripheral blood sample

So, apples and oranges here, but both the apple and orange are agreeing on a unit type? Just the count of things (number of X / number of Y)? Not that it really matters if you want to create a number like an acceleration, you could put 30 meters per second / 10 seconds and come up with an acceleration of 3 meters per second per second... But my point on units is that the number you get is going to be based on those units you find in the data (feet per second vs meters per second) so somewhere we should account for units in play.

But I see your point about using 2 values of completely different things and making a value out of it, thank you for clarifying that.

eldarallakhverdiiev commented 6 years ago

Well we can leave it for user : what should be the units of numerator and denominator. If he assumes that both values can have different units in the data ( Neutrophills in billions/litre and units/microlitre; Lymphocyte - the same pair) and thus billions can be wrongly divided by raw count- then let him use 'Add Group' and create 'having any' where billions will compare only with billions , raw count - with raw count, and billions to raw count- with a threshold/1000 (and vice versa)

And meanwhile let's force vocabulary developers to make appropriate units conversion