OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
877 stars 448 forks source link

Fact_relationship changes #230

Open gowthamrao opened 5 years ago

gowthamrao commented 5 years ago

We decided to introduce concept_id's for each field in OMOP cdm. e.g. we will be assigning a concept_id for person.person_id, visit_occurrence.visit_occurrence_id etc. To be released in future version of vocabulary. e.g. cost.cost_event_field_concept_id needs concept_id for each identify of the field.

Proposal 1: This proposal is to change FACT_RELATIONSHIP table's domain_concept_id to field_concept_id, where field_concept_id is the identity of the field in omop cdm.

Field Required Type Description
~domain_concept_id_1~ ~Yes~ ~integer~ ~The concept representing the domain of fact one, from which the corresponding table can be inferred.~
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
~domain_concept_id_2~ ~Yes~ ~integer~ ~The concept representing the domain of fact two, from which the corresponding table can be inferred.~
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.

Proposal 1: will help overcome domain ambiguity such as does the domain_id = 'Visit' represent visit_occurrence or visit_detail.

gowthamrao commented 5 years ago

Proposal 2: relationship are generally valid for a certain interval of time, e.g. start_date and end_date. We propose adding two optional fields relationship_start_date_time and relationship_end_date_time to fact_relationship table.

Examples of relationships:

Field Required Type Description
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.
relationship_start_datetime No datetime Datetime of start of the relationship between fact one and fact two
relationship_end_datetime No datetime Datetime of end of the relationship between fact one and fact two
gowthamrao commented 5 years ago

There are some fact_relationship like tables in OMOP CDM, that may be subsumed by this proposal. e.g. Location_history. Location_history may be replaced by this new fact_relationship, because location_history is similar to fact_relationship table. See below mapping between location_history to fact_relationship.

location_history (current) fact_relationship (new) fact_relationship
location_id fact_id_2 fact_id_2
relationship_type_concept_id relationship_concept_id relationship_concept_id
domain_id domain_concept_id_2 field_concept_id_2
entity_id fact_id_1 fact_id_1
start_date   relationship_start_datetime
end_date   relationship_end_datetime
  domain_concept_id_1 field_concept_id_1
cgreich commented 5 years ago

Like it. If I knew how I'd put this thumb up emoticon in. But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099

gowthamrao commented 5 years ago

But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099

should datetime's be optional or required? if required, i agree with default.

Alternatively, we can keep them optional and assume during analytic time that if the datetimes are null then the value for the interval is from earliest observation_period_start_date to latest observation_period_end_date (computed during analysis time - similar to how we build cohorts in Atlas, where observation_period_end_date is the default cohort_end_date).

gowthamrao commented 5 years ago

Proposal 3: when we are making changes, why not add _type_concept_id

Field Required Type Description
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.
relationship_type_concept_id Yes integer A foreign key identifier to a concept in the CONCEPT table for the provenance or the source of the relationship. E.g. derived during ETL, vs. mapped from source
relationship_start_datetime No datetime Datetime of start of the relationship between fact one and fact two
relationship_end_datetime No datetime Datetime of end of the relationship between fact one and fact two

Example:

pavgra commented 5 years ago

Although the proposal 2 itself sounds good and allows to remove location_history table, I would be very careful with the general idea of untypified relations. Having more flexibility, you lose both logical separation of concerns between tables and physical data integrity checks provided by RDBMS (e.g. you know that in location_history you can put only relations of location and 3 domains; while with fact_relationship you can do anything and it is not clear whether it is correct or not). So moving in that direction too fast can easily leave you with RDF and pure ontologies.

rtmill commented 5 years ago

It makes sense and its a clean design. To me its a trade off between neat and compact vs. efficient and segregated. I think there's potential for this table becoming massive but it seems like a universal means to get at these one-to-many issues that need to be addressed (e.g. multiple care sites for a provider, multiple providers for a patient, etc).

pavgra commented 5 years ago

I would propose to treat fact_relationship table as a POC table. So that when you have a new use-case which requires some new relations between tables (but those relations don't exist yet), you could use fact_relationship to model the relations w/o a need to modify CDM schema. But the approach has some drawbacks:

So, once a POC has been validated and common use-case was figured out, a person can propose to create permanent tables so that we:

So, following the idea, location_history looks like such typical use-case and is a valid standalone entity, while there are some other cases where fact_relationship should be enough for now.

cgreich commented 5 years ago

@pavgra: Point taken. The FACT_RELATIONSHIP_TABLE should really be the last resort and contain the relationships for which creating a correctly modeled concrete RDBMS solution with FKs is just not worth it, because we will need it like 21 times for the 1.9B patients. All others should be modeled properly.

gowthamrao commented 5 years ago

@cgreich @pavgra like everyone else, what i like about the fact_relationship table is that it is flexible. what i dont like is that it is flexible. Flexibility allows us to use one structure for an infinite number of use cases, but these uses are hard to solve in a fact_relationship table (see points that @pavgra made). Also, I like fact_realtionship table because ETL can be done in a generic way - and we dont have to revise our ETL every time for POC'ing a new use cases.

I like @pavgra comment of using fact_relationship table as the source for a POC table. Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.

using the generic fact_relationship table, we could POC three new tables that follow proper RDBMS referential integrity - all derived from fact_relationship table. i.e. instead of location_history (that has location_id with referential integrity, but entity_id has no referential integrity), we could create person_location_history, provider_location_history, care_site_location_history each with RDBMS referential integrity - test it out in POC, build software in dev brach, validate it: and then ask the community to ratify validated tables person_location_history, provider_location_history etc. Plus these tables will be smaller in size and not go to billions of rows (i.e. performance)

In short - i like @pavgra comment of using fact_relationship as the POC.

pavgra commented 5 years ago

Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.

Gowtham, that would be so great!

rtmill commented 5 years ago

@gowthamrao @pavgra I get that its preferable to maintain referential integrity but at what cost? A query to check the validity of these relationships would be simple enough.

Point being, if we keep going down this path we're going to end up with an excessive number of tables. e.g. person_location_history provider_location_history care_site_location_history person_care_site_history person_provider_history provider_care_site_history

And if we want to be consistent then we need to remove similar approaches in other tables. OBSERVATION_EVENT_ID and OBS_EVENT_FIELD_ID for example, would then require additional tables: observation_procedure_history observation_measurement_history observation_diagnosis_history etc

Is it worth it?

pavgra commented 5 years ago

@rtmill , for me it sounds that you go to extremum, why don't you want to end up then with a single table with three field - "subject", "relationship", "object" (which is RDF)? It would cover the whole CDM.

Although I'd like to have a relations table per each domain pair, I have to admit that it will expand CDM too much and too fast for regular business users. So, I just stated that the hybrid approach, where subject is typified (e.g. location_history.location_id is a concrete FK), relation is described via the table itself (location_history - relation of some domain with location) and object (person / care_site / ...) is dynamic, is the least evil between two extremums (until there is a chance to have normalized relational CDM and DW CDM).

AEW0330 commented 5 years ago

I think this is such an important structural issue that it deserves a wider community conversation. I like the balance @pavgra proposes and agree we don't want fact_relationship to ingest and annotate the rest of the CDM. But I also don't know the answer to the question implicit in @rtmill 's comments: how much expansion can the CDM's primary relational structure support? And that seems to be an increasingly pressing issue.

gowthamrao commented 5 years ago

@aew0330

Welcome to OHDSI! Maybe you already have, if not, could you please introduce yourself here http://forums.ohdsi.org/t/welcome-to-ohdsi-please-introduce-yourself/704

AEW0330 commented 5 years ago

AEW0330 is my (Andrew Williams) github ID. You know me :)

mgurley commented 5 years ago

I like @pavgra description of FACT_RELATIONSHIP as a POC for adding foreign key relationships. In that vein, what is the recommend/conventional relationship to use in FACT_RELATIONSHIP to represent a 'has many/belongs to' relationship?

cgreich commented 5 years ago

Friends:

Two things:

  1. Not sure about the POC. We cannot jerk everybody around all the time. The CDM has to move slowly. Unless by "POC" you mean something we never standardize or document in the main model. Which one?
  2. The timing for the FR: I first like the idea, then realized all the Facts already have a timing. Except PERSON, PROVIDER and CARE_SITE. Do you have these specificially in mind, @gowthamrao?
dimshitc commented 5 years ago

Where are with this? FACT_RELATIONSHIP still has these ugly domain_concept_ids

clairblacketer commented 5 years ago

Agree with @dimshitc - @gowthamrao, @cgreich is this proposal ready to discuss?

cgreich commented 5 years ago

Yes, Ma'am.