OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
854 stars 442 forks source link

[Proposal] add person_id to the cost table #564

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

The OMOP CDM is a person centric data model and all the "facts" have an associated person_id.... except for costs. To add up the costs for a person in v5.4 we have to join the cost table to every other domain table in the CDM. I would like to propose adding a person_id column to the cost table.

One use case is to easily add up total costs per person which of course is supported but is a verbose query requiring lots of joins just to get the person id for each cost record. I believe this would be a non breaking change that could be done in a minor version.

clairblacketer commented 1 year ago

Hi I think the v6.0 COST table is a much better representation of the medical costs. It is also not as easy as adding up costs per person because cost is done on a claim level (in the US) and not at the person level. I would suggest we look to incorporating the v6.0 COST table at a later date

ablack3 commented 1 year ago
image

It's not easy to quickly tell what the change in the cost table is from 5.4 to 6.0.

ablack3 commented 1 year ago

Would a single claim ever be associated with more than one person? I think this would only happen in context of a birth right?

My understanding is that most or all US claims are submitted using form CMS1450 (aka UB04) or CMS1500. It looks to me like these forms only allow one person on a claim so every cost record should be attributable to a single person.

image image
cgreich commented 1 year ago

Would a single claim ever be associated with more than one person? I think this would only happen in context of a birth right?

No. The birth goes to the mother. The baby doesn't have a prenatal insurance policy.

The issue is not whether or not a cost record can have more than one person_id. It cannot. The problem is it would violate the so-called 2NF database normalization rule: Eliminate Redundant Data. Right now, every record depends on a single primary key: the combo of cost_event_id + cost_domain_id. Adding person_id is redundant and can create integrity issues. Which means it will.

ablack3 commented 1 year ago

The problem is it would violate the so-called 2NF database normalization rule: Eliminate Redundant Data.

Domain tables like condition_occurrence have foreign keys person_id, visit_occurrence_id, and visit_detail_id. Both the visit_detail and visit_occurrence tables have person_id as well.

In v5.4 the episode and episode_event tables were added. The episode table has person_id and links to itself through the episode_parent_id so it would be possible for a parent episode and child episode to have a conflicting person_id.

The episode table also links to episode_event records which can link to any domain table each with person_id.

Would you say that the CDM v5.4 conforms to the 2NF database normalization rule?

cgreich commented 1 year ago

Yeah, these cross-cutting FKs have nasty side effects. I tend to dislike them and torture people with "give me the use case".

But it still is different: Cost cannot exist without an event (the one that costs the money). But Visits can exist without any condition or procedure or other event. Therefore, it is inevitable to have some 2NF violations.

Episodes: Consider that table a POC. We still need to learn how to fill it in a meaningful way.

ablack3 commented 1 year ago

So costs must have an associated event. A single event can have more than one cost record which is why costs need their own table to support the one-to-many relationship. Within a single event there may be multiple "line items" denoted by revenue codes so as long as revenue codes are included we need a separate cost table to accommodate those 1:many relationships.

Every cost record has one and only one associated person_id.

person_id is required on the v6.0 cost table so clearly this idea had some legs.

Removing person_id from cost in v6.1 will be a breaking change from 6.0 to 6.1.

The CDM has a lot of 2NF violations going on for the sake of query efficiency. For example the era tables are pre-computed analytics derived from domain tables, each resulting in a 2NF violation.

The case for adding person_id to cost in v5.5 or keeping it in v6.1, resulting in a 2NF violation, would be for query efficiency (i.e. costs/person).

Is all that correct?

Also the graphic of CDM 5.4 shows a line from cost to person_id and no connections to the domain tables so that should probably be fixed at some point.

image
cgreich commented 1 year ago

All correct, except the 6.0->6.1 succession. 6.0 is a dead end. We said we will remove it.

Except the era table: It doesn't have the triangle. It is not connected to DRUG_EXPOSURE, even though it is derived from it. We discussed at some point to have a link from _EXPOSURE TO _ERA, but nobody seems to have a use case for that.

Query efficiency: makes sense to consider that, but we probably would then need a DQD test to make sure there is no integrity artifact in the data.

Graph: Correct. Thanks.