Open gowthamrao opened 6 years ago
Proposal 2: relationship are generally valid for a certain interval of time, e.g. start_date and end_date. We propose adding two optional fields relationship_start_date_time and relationship_end_date_time to fact_relationship table.
Examples of relationships:
Field | Required | Type | Description |
---|---|---|---|
field_concept_id_1 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact one. |
fact_id_1 | Yes | integer | The unique identifier in the table corresponding to the domain of fact one. |
field_concept_id_2 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact two. |
fact_id_2 | Yes | integer | The unique identifier in the table corresponding to the domain of fact two. |
relationship_concept_id | Yes | integer | A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies. |
relationship_start_datetime | No | datetime | Datetime of start of the relationship between fact one and fact two |
relationship_end_datetime | No | datetime | Datetime of end of the relationship between fact one and fact two |
There are some fact_relationship like tables in OMOP CDM, that may be subsumed by this proposal. e.g. Location_history. Location_history may be replaced by this new fact_relationship, because location_history is similar to fact_relationship table. See below mapping between location_history to fact_relationship.
location_history | (current) fact_relationship | (new) fact_relationship |
---|---|---|
location_id | fact_id_2 | fact_id_2 |
relationship_type_concept_id | relationship_concept_id | relationship_concept_id |
domain_id | domain_concept_id_2 | field_concept_id_2 |
entity_id | fact_id_1 | fact_id_1 |
start_date | relationship_start_datetime | |
end_date | relationship_end_datetime | |
domain_concept_id_1 | field_concept_id_1 |
Like it. If I knew how I'd put this thumb up emoticon in. But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099
But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099
should datetime's be optional or required? if required, i agree with default.
Alternatively, we can keep them optional and assume during analytic time that if the datetimes are null then the value for the interval is from earliest observation_period_start_date to latest observation_period_end_date (computed during analysis time - similar to how we build cohorts in Atlas, where observation_period_end_date is the default cohort_end_date).
Proposal 3: when we are making changes, why not add _type_concept_id
Field | Required | Type | Description |
---|---|---|---|
field_concept_id_1 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact one. |
fact_id_1 | Yes | integer | The unique identifier in the table corresponding to the domain of fact one. |
field_concept_id_2 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact two. |
fact_id_2 | Yes | integer | The unique identifier in the table corresponding to the domain of fact two. |
relationship_concept_id | Yes | integer | A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies. |
relationship_type_concept_id | Yes | integer | A foreign key identifier to a concept in the CONCEPT table for the provenance or the source of the relationship. E.g. derived during ETL, vs. mapped from source |
relationship_start_datetime | No | datetime | Datetime of start of the relationship between fact one and fact two |
relationship_end_datetime | No | datetime | Datetime of end of the relationship between fact one and fact two |
Example:
Although the proposal 2 itself sounds good and allows to remove location_history
table, I would be very careful with the general idea of untypified relations. Having more flexibility, you lose both logical separation of concerns between tables and physical data integrity checks provided by RDBMS (e.g. you know that in location_history you can put only relations of location and 3 domains; while with fact_relationship you can do anything and it is not clear whether it is correct or not). So moving in that direction too fast can easily leave you with RDF and pure ontologies.
It makes sense and its a clean design. To me its a trade off between neat and compact vs. efficient and segregated. I think there's potential for this table becoming massive but it seems like a universal means to get at these one-to-many issues that need to be addressed (e.g. multiple care sites for a provider, multiple providers for a patient, etc).
I would propose to treat fact_relationship
table as a POC table. So that when you have a new use-case which requires some new relations between tables (but those relations don't exist yet), you could use fact_relationship
to model the relations w/o a need to modify CDM schema. But the approach has some drawbacks:
So, once a POC has been validated and common use-case was figured out, a person can propose to create permanent tables so that we:
So, following the idea, location_history
looks like such typical use-case and is a valid standalone entity, while there are some other cases where fact_relationship should be enough for now.
@pavgra: Point taken. The FACT_RELATIONSHIP_TABLE should really be the last resort and contain the relationships for which creating a correctly modeled concrete RDBMS solution with FKs is just not worth it, because we will need it like 21 times for the 1.9B patients. All others should be modeled properly.
@cgreich @pavgra like everyone else, what i like about the fact_relationship table is that it is flexible. what i dont like is that it is flexible. Flexibility allows us to use one structure for an infinite number of use cases, but these uses are hard to solve in a fact_relationship table (see points that @pavgra made). Also, I like fact_realtionship table because ETL can be done in a generic way - and we dont have to revise our ETL every time for POC'ing a new use cases.
I like @pavgra comment of using fact_relationship table as the source for a POC table. Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.
using the generic fact_relationship table, we could POC three new tables that follow proper RDBMS referential integrity - all derived from fact_relationship table. i.e. instead of location_history (that has location_id with referential integrity, but entity_id has no referential integrity), we could create person_location_history, provider_location_history, care_site_location_history each with RDBMS referential integrity - test it out in POC, build software in dev brach, validate it: and then ask the community to ratify validated tables person_location_history, provider_location_history etc. Plus these tables will be smaller in size and not go to billions of rows (i.e. performance)
In short - i like @pavgra comment of using fact_relationship as the POC.
Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.
Gowtham, that would be so great!
@gowthamrao @pavgra I get that its preferable to maintain referential integrity but at what cost? A query to check the validity of these relationships would be simple enough.
Point being, if we keep going down this path we're going to end up with an excessive number of tables. e.g. person_location_history provider_location_history care_site_location_history person_care_site_history person_provider_history provider_care_site_history
And if we want to be consistent then we need to remove similar approaches in other tables. OBSERVATION_EVENT_ID and OBS_EVENT_FIELD_ID for example, would then require additional tables: observation_procedure_history observation_measurement_history observation_diagnosis_history etc
Is it worth it?
@rtmill , for me it sounds that you go to extremum, why don't you want to end up then with a single table with three field - "subject", "relationship", "object" (which is RDF)? It would cover the whole CDM.
Although I'd like to have a relations table per each domain pair, I have to admit that it will expand CDM too much and too fast for regular business users. So, I just stated that the hybrid approach, where subject is typified (e.g. location_history.location_id
is a concrete FK), relation is described via the table itself (location_history
- relation of some domain with location) and object (person / care_site / ...) is dynamic, is the least evil between two extremums (until there is a chance to have normalized relational CDM and DW CDM).
I think this is such an important structural issue that it deserves a wider community conversation. I like the balance @pavgra proposes and agree we don't want fact_relationship to ingest and annotate the rest of the CDM. But I also don't know the answer to the question implicit in @rtmill 's comments: how much expansion can the CDM's primary relational structure support? And that seems to be an increasingly pressing issue.
@aew0330
Welcome to OHDSI! Maybe you already have, if not, could you please introduce yourself here http://forums.ohdsi.org/t/welcome-to-ohdsi-please-introduce-yourself/704
AEW0330 is my (Andrew Williams) github ID. You know me :)
I like @pavgra description of FACT_RELATIONSHIP as a POC for adding foreign key relationships. In that vein, what is the recommend/conventional relationship to use in FACT_RELATIONSHIP to represent a 'has many/belongs to' relationship?
Friends:
Two things:
Where are with this? FACT_RELATIONSHIP still has these ugly domain_concept_ids
Agree with @dimshitc - @gowthamrao, @cgreich is this proposal ready to discuss?
Yes, Ma'am.
We decided to introduce concept_id's for each field in OMOP cdm. e.g. we will be assigning a concept_id for person.person_id, visit_occurrence.visit_occurrence_id etc. To be released in future version of vocabulary. e.g. cost.cost_event_field_concept_id needs concept_id for each identify of the field.
Proposal 1: This proposal is to change FACT_RELATIONSHIP table's domain_concept_id to field_concept_id, where field_concept_id is the identity of the field in omop cdm.
Proposal 1: will help overcome domain ambiguity such as does the domain_id = 'Visit' represent visit_occurrence or visit_detail.