OHDSI / circe-be

CIRCE is a cohort definition and syntax compiler tool for OMOP CDMv5
Apache License 2.0
9 stars 13 forks source link

Cohort build - using payer-plan-period #4

Closed gowthamrao closed 6 years ago

gowthamrao commented 7 years ago

OMOP https://github.com/OHDSI/CommonDataModel/blob/master/Documentation/CommonDataModel_Wiki_Files/StandardizedHealthEconomicsDataTables/PAYER_PLAN_PERIOD.md is not a concept_id based table.

Payer-plan-period only has _source_values (payer_source_value, plan_source_value and family_source_value). Can cohort's be built using some form of regular-expression search of _source_values? Alternatively, the _source_value may hold a Json or XML object.

Can we use _source_value XML, Json or regular expression search on string as part of criteriaquery for example?

chrisknoll commented 7 years ago

Few considerations: With very few exceptions, we want to have cohort expressions work with standard concepts so that you can share the definition across a network and all expectations about the meaning of the concepts is understood. IMO, in this case, payer plan period needs to have some notion of standardized elements in it if we want to be able to create standardized definitions based on information in the PAYER_PLAN_PERIOD table. If the CDM working group can facilitate this, then we can definitely extend the cohort expression to leverage it.

If we want to go against this philosophy, then the other consideration is that sql render needs to be able to support the different flavors of xml/json/regex that is available in the different database platforms.

gowthamrao commented 7 years ago

I don't think cdm working group can facilitate this @cgreich - thoughts?

The fields in question are:

These are not coded concepts but are either full local names or local codes for local names in source data. We can't standardize them. Similar fields in omop cdm are the _source_value fields like care_site_source_value, person_source_value, provider_source_value, specimen_source_value. We could add new fields that are just integer representation of the _source_value like payer_id, plan_id similar to care_site_id, person_id, provider_id or specimen_id but both _source_value and _id don't have a semantic meaning and don't represent a concept. Contrast to_source_code, _concept_id : they have semantic meaning that may be standardized.

_source_values cannot be used in standardized/network studies. But they are very valuable to local studies. E.g. care_site_source_value may contain the room in a hospital in the "third floor wing five #3451", which only local hospital staff understand. They want that information in their local studies. I think the _source_values in payer_plan_period are examples of a similar challenge.

In the absence of standardization - are we really going against the philosophy or is this a whole new problem that network research is not interested in solving but local data users are?

gowthamrao commented 7 years ago

@chrisknoll i think it is a whole new problem, and one potential solution is to introduce the construct of Factsets. Factsets are similar to conceptset but instead of collection of concept-id's they are a collection of fact-id's. Fact-id's are the primary-keys of the omop cdm tables (e.g. payer_plan_period_id, visit_occurrence_id, observation_period_id, etc.). We store expressions that generate these primary key's and these expressions are regular-expressions on _source_value.

CREATE TEMP TABLE Factsets  (Factset_id int NOT NULL,
  Fact_id bigint NOT NULL
)
;

INSERT INTO Factsets  (Factset_id , Fact_id )
SELECT 0 as Factset_id , c.Fact_id FROM (select distinct I.Fact_id FROM
( 
  select payer_plan_period_id as Fact_id from @cdm_database_schema.payer_plan_period where <regular_expression on payer_source_value>
) I
) C;

INSERT INTO Factsets  (Factset_id , Fact_id )
SELECT 1 as Factset_id , c.Fact_id FROM (select distinct I.Fact_id FROM
( 
  select visit_occurrence_id as Fact_id from @cdm_database_schema.visit_occurrence_id where <regular_expression on visit_source_value>
) I
) C;

We then use the factset's in @criteriaQueries as follows

  -- Begin Visit Occurrence Criteria
select C.person_id, C.visit_occurrence_id as event_id, C.visit_start_date as start_date, C.visit_end_date as end_date, C.visit_concept_id as TARGET_CONCEPT_ID
from 
(
  select vo.*, ROW_NUMBER() over (PARTITION BY vo.person_id ORDER BY vo.visit_start_date, vo.visit_occurrence_id) as ordinal
  FROM 
(select vo.* from @cdm_database_schema.VISIT_OCCURRENCE vo
left join (select fact_id from Factsets where fact_set_id = 1) as fs
on vo.visit_occurrence_id= fs.fact_id
where fs.fact_id is not null
) as vo
@codesetClause
) C
@joinClause
@whereClause
-- End Visit Occurrence Criteria
  -- Begin Payer Plan Period Criteria
select C.person_id, C.Payer_Plan_Period_id as event_id, C.Payer_Plan_Period_start_date as start_date, C.Payer_Plan_Period_end_date as end_date, null as TARGET_CONCEPT_ID, null as visit_occurrence_id
from 
(
  select pp.*, ROW_NUMBER() over (PARTITION BY pp.person_id ORDER BY pp.payer_plan_period_start_date, pp.payer_plan_period_id) as ordinal, null as visit_occurrence_id
  FROM
 (select pp.* from @cdm_database_schema.PAYER_PLAN_PERIOD pp
left join (select fact_id from Factsets where fact_set_id = 0) as fs
on pp.payer_plan_period_id= fs.fact_id
where fs.fact_id is not null
) as pp
@codesetClause

) C
@joinClause
@whereClause
-- End Payer Plan Period Criteria

Similarly we could do for @QualifiedLimitFilter, @additionalCriteriaQuery, etc

chrisknoll commented 7 years ago

Hi, @gowthamrao. I'll thnk over this, but my instinct tells me that this is leading us to an abstraction that makes the system less well-defined. Today we have different criteria types relegated to the specific domains that they contain. With 'fact sets', what's to say that we just throw all those types away and just refer to everything as 'facts'? Sounds more flexible but my experience has been that it's just less structured (and harder to manage). If I'm looking at this correctly, it looks like fact sets are just a way to alias results coming out of a criteria query? That's a different way to do it but that doesn't really solve our problem of this Issue which is to incorporate payer_plan_period...in that we either get standard concepts to reference in the payer_plan_period table or we introduce some sort of text-mining capablity in sqlRender. The former can be standardized, the later is not standardized and a lot more work to implement (from a sqlrender perspective).

So, I'll continue to think this over, but I don't think this is a good idea.

gowthamrao commented 7 years ago

it looks like fact sets are just a way to alias results coming out of a criteria query?

Can you elaborate this one - maybe we have a disconnect here.

chrisknoll commented 7 years ago
INSERT INTO Factsets  (Factset_id , Fact_id )
SELECT 1 as Factset_id , c.Fact_id FROM (select distinct I.Fact_id FROM
( 
   -- insert any criteria query here, but return the domain's record_id as the fact_id
) I
) C;

And then Primary events query becomes:

select event_id, condition_start_date as start_date .... -- op_end_date, visit_id, etc.
From Condition_occurrence co
join (select fact_id from Factsets where fact_set_id = 1) fs
on co.condition_occurrence_id= fs.fact_id

I'm copying how you said to use the factset table, but changed it from a left join, I 'm not sure what the advantage of left join + where ... is null vs. doing a simple join. But my point of this is: what's the point of creating factsets when you can just refer to the fields that you want directly in the criteria query?

Part of my concerns is not just usability across a network and clearly defined constructs (ie criteria types), but also performance considerations when doing the queries. When you create a factset table, you loose all query optimization and index contexts on the underlying tables you're creating facts from. So let's say you create a fact set that resolves to a billion visits, but then gets widdled down to a few thousand after you have other patient criteria applied...your database performance blows up because it can't optimize the data retrieval from the #factsets table and the underlying datastore. I know I'm getting into the weeds here, but all I can say is I don't think factsets solve a specific problem..it just abstracts things....

I know this is off topic on this issue 'Payer plan periods in cohort defs', but what I think you are circling is that you'd like a way to create 'criteria expressions' that can be reused interchangeably across different parts of a cohort expression. If that's something you are trying to do, the object model in circe-be can be the building blocks, and you just have to create a new structure that sits on top of them to create these re-usable pieces and then materialize them into a circe cohort expression at execution time. This isn't what circe does, it is soemthing that can be built on top of circe, but I think that would be either an extension to the circe-be capablity or a completely external library that references circe-be. If what I've just said isn't something you're tryign to accomplish, please disregard :)

-Chris

gowthamrao commented 7 years ago

@chrisknoll what you said makes sense. Based on your input proposal to CDM WG https://github.com/OHDSI/CommonDataModel/issues/120

gowthamrao commented 6 years ago

@chrisknoll are these the exceptions http://www.ohdsi.org/web/atlas/#/cohortdefinition/1126937

Few considerations: With very few exceptions, we want to have cohort expressions work with standard concepts so that you can share the definition across a network and all expectations about the meaning of the concepts is understood.

WHERE C.stop_reason  like '%eddi%'
AND C.lot_number  like '%ddf%'
-- End Drug Exposure Criteria
chrisknoll commented 6 years ago

Yes, those are the exceptions.

chrisknoll commented 6 years ago

@gowthamrao , if this issue is addressed in the recent PR for the payer plan period criteria, could you close this? Otherwise, please let me know what items remain open that should keep this issue open.

Thank you!