OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
877 stars 448 forks source link

Cost table changes (add PERSON_ID, dates and normalize) #81

Closed clairblacketer closed 3 years ago

clairblacketer commented 7 years ago

Please see Gowtham's updated Cost table proposal in the comments below

Cost Table Changes (Add Person_id, Dates and normalize)

Proposal Owner: Gowtham Rao, Chris Knoll, Klaus Bonadt

Discussion: forum post

Cost table description: COST

Proposal overview:

1. Add Person_id to Cost table

2. Add incurred_date, incurred_datetime, billed_datetime, paid_datetime:

Use cases:

Analytic questions:

Importance:

Consequence of doing it:

Consequences of not doing it:

3. Leverage cost_type_concept_id and remove cost type columns

4. Add new field cost_domain_concept_id

gowthamrao commented 7 years ago

~~@cgreich next CDM meeting? ~~

For cost_domain_concept_id - we used the concept-id of the domain that represents the costs. This is more efficient. 8 Visit 10 Procedure 13 Drug 17 Device 19 Condition 21 Measurement 27 Observation 36 Specimen

changed proposal from cost_domain_concept_id to cost_source_table_concept_id

cgreich commented 7 years ago

@gowthamrao:

You mean you are using numeric IDs instead of the string ones to gain performance?

gowthamrao commented 7 years ago

Yes. That's correct. We felt that the Integer representations leads to performance gain, makes joins efficient, and are probably the reason why OMOP CDM is so fast. I don't have benchmarks, but compared to text, pretty sure there is exponential gain.

Any reason to the text? Cost tables can go to millions/billions of rows

abedtashh commented 7 years ago

I agree with using Concept IDs instead of the text due to better performance. It also applies to other columns in the CDM representing an ID, like Relationship_ID, Class_ID, etc. I prefer using integer instead of text as it would follow database design best practices; however, it makes the tables less human readable before joining the tables with the description of the IDs.

cgreich commented 7 years ago

You are totally right. But: We went from numeric IDs to alphanumeric ones for the vocabulary_id, domain_id, relationship_id, concept_class_id. Reason is that it is so much easier to do manual queries (you otherwise have to constantly join the reference tables vocabulary, relationship, concept_class etc.) and these are not that many, so you can just index by them (and essentially turn them internally into sequential numerical IDs).

gowthamrao commented 7 years ago

changed proposal from cost_domain_concept_id to cost_source_table_concept_id

@cgreich I think it is fine for the vocabulary tables because they are relatively small and don't run over billions of rows. But in the clinical tables run over billions of rows, using text as a ID is an inefficient use of the database space. Plus text dont support MPP systems - so you can't shard the data over many nodes. We need to use integers like @abedtashh said.

For the users who want to use 'human readable' alphanumeric ones, they can always do:

NEW

select cost.*
from
       <schema>.cost as cost
inner join
      ( select domain_concept_id
        from <schema>.domain
        where domain_id = 'VISIT'
     ) as domain
on a.cost_domain_id = domain.domain_concept_id
;

so they don't have to memorize the concept-id's, and can retain the human readable.

Current

select cost.*
from
       <schema>.cost as cost
where cost_domain_id = 'VISIT';
gowthamrao commented 7 years ago

@cgreich @abedtashh @fdefalco what are your thoughts on 'cost_type_concept_id and remove cost type columns' From actuarial, economical and financial analytics perspective - this would really help because we can now support incurred_date, paid_date, billed_date AND we can support so many different types of cost_type's using standardized concept_id's.

@cgreich would like to discuss this at next cdm vocab workgroup - if possible.

clairblacketer commented 7 years ago

Hi @gowthamrao you have been added to the agenda for 9/5

cgreich commented 7 years ago

@gowthamrao:

Actually, bring up both: The numerical ID and the cost. Even though I am not getting what you mean by "cost_type_concept_id and remove cost type columns"

gowthamrao commented 7 years ago

Will discuss them in the workgroup, but for those who read in advance.

remove wide columns that represent cost types: the proposal is to convert wide to long tables. i.e. currently we have certain types of costs represented in cost table. total_charge total_cost total_paid paid_by_payer paid_by_patient paid_patient_copay paid_patient_coinsurance paid_patient_deductible paid_by_primary paid_ingredient_cost paid_dispensing_fee amount_allowed some of these are only relevant to drug_occurrence table paid_ingredient_cost paid_dispensing_fee some these are redundant total_cost total_paid paid_by_payer paid_by_primary if not redundant, they are just not clean. We have an opportunity to make it clean. The number of cost types can be very long -- in the world of 'payment innovation', lot of new type of payment systems are happening - including 'care coordination', 'bonus', 'shared savings', etc. How many columns are we going to add?

We cant add more columns -- we need make this table long form, by creating a cost type concept-id. That way we can do have 'source_concept_id' and 'standardized_concept_id' -- and do stanardized/distributed OHDSI cost studies, AND, support local studies.

gowthamrao commented 7 years ago

@cgreich @abedtashh another point in favor of using integers for the *_domain_id is we are using integers for _domain_id in FACT_RELATIONSHIP table

cgreich commented 7 years ago

Why is that any better?

gowthamrao commented 7 years ago

@cgreich why use concept_id vs concept name? because concept_id's are integer, and concept_name are text -- and integers are more efficient

gowthamrao commented 7 years ago

@clairblacketer is there a way for me to edit the proposal?

clairblacketer commented 7 years ago

Hi @gowthamrao unfortunately not within github itself. However, you can either send me your edits and I can add them here or you can create a new issue with your edits referencing this one.

gowthamrao commented 7 years ago

changed proposal from cost_domain_concept_id to cost_source_table_concept_id

@clairblacketer then could you please do the following changes to the proposal

cost_domain_id --> cost_domain_concept_id to clarify based on comments above. Original cost_domain_id is 'character' field, we want a new integer field called cost_domain_concept_id which is an integer that currently may be one of the following.

8 Visit 10 Procedure 13 Drug 17 Device 19 Condition 21 Measurement 27 Observation 36 Specimen

cost_concept_id relies on new to be created concept_id's

clairblacketer commented 7 years ago

@gowthamrao done!

gowthamrao commented 7 years ago

Anybody know the rationale behind leaving revenue_code/drg in cost_table? I think observation table with ability to link to visit tables and cost (via visit table) is the right place for revenue_code/drg

gowthamrao commented 7 years ago

@clairblacketer after discussing wtih @cgreich - want to change the proposal as a follows. Withdraw the field cost_domain_concept_id and replace with cost_record_table_concept_id.

Explanation: _domain_concept_id --> was proposed, but is being withdrawn. This is because it is "dirty", as more than one table may belong to the same domain e.g. drug and drug_era, condition_occurrence and condition_era, or visit_occurrence and visit_detail. We cannot tell if the cost_event_id is referencing the visit_occurrence or visit_detail, condition_occurrence .

But we need an alternative, because current cost CDM uses cost_domain_id which is as dirty as _domain_concept_id but also is inefficienct because it is a text search instead of an efficient integer search.

To solve linkage with clinical-event tables (visit_occurrence_id -- cost_event_id or condition_occurrence_id -- cost_id) we are proposing a new field called cost_source_table_concept_id. This proposal is both a new field in Cost CDM, and also new vocabulary.

New vocabulary: For every table in the OMOP CDM we will need to create a new concept_id. See attached. It will be a one time task of inserting into omop vocabulary with maintenance when a new table is added to the omop cdm. See proposed concepts for event_table_concept_id

Advantages: The cost table will now be linkable to any clinical event table using a combination of cost_source_table_concept_id and cost_event_id - where cost_event_id is the FK to the PK of the table represented in the cost_source_table_concept_id.

This will make the queries faster and cleaner.

There are other advantages - e.g. cohort table. The subject_id in the cohort_table maybe the person_id, provider_id, visit_occurrence_id, visit_detail_id --- as a cohort maybe more than just a person (although we currently use it almost exclusively as a person). Adding this table concept_id (may - not part of this proposal) allow us to create cohorts that using a combination of subject_id and _source_table_concept_id is linkable to the table.

gowthamrao commented 6 years ago

Cost Table Changes (Add Person_id, Dates and normalize)

Proposal Owner: Gowtham Rao, Chris Knoll, Klaus Bonadt

Discussion: forum post

Cost table description: COST

Proposal overview:

1. Add Person_id to Cost table

2. Add billed_date, paid_date:

Add two new date fields in the cost table. billed_date and paid_date Both fields are optional (Required = No), they are date fields Costs are associated with a visit_occurrence, procedure_occurrence, drug_occurrence, observation, device etc. There are generally three types of dates associated with costs. Incurred date, Billed Date and Paid date. Incurred date is the date of the service. They are captured in the respective visit, procedure etc. Billed date and paid date are not captured in OMOP CDM. Use cases for these are listed below

Use cases: Health economics and actuarial analysis use incurred date, billed date and paid date in the formulas Incurred But not reported: https://en.wikipedia.org/wiki/Incurred_but_not_reported amount owed by an insurer to all valid claimants who have had a covered loss but have not yet reported it. Very important for claim reserves estimation (represent the money which should be held by the insurer so as to be able to meet all future claims arising from policies currently in force and policies written in the past.) Completion factor trend analysis

Analytic questions: Trend analysis for operational efficiency - claims adjudication rate, completion factor, IBNR, claims reserve estimation

Importance: New use cases around financial and actuarial departments of organizations This will expand the OHDSI/OMOP footprint

Consequence of doing it: Adoption of Cost table may increase. New use cases that serve the needs of financial entities in an organization will expand the adoption of the OMOP CDM.

Consequences of not doing it: Query optimization is difficult. Analyst has to write complex queries. Development of standardized tools may be delayed

3. Change Cost table structure - from wide to tall by using cost_concept_id and cost_source_concept_id

Proposed structure of new table is below The idea is that instead of making each cost-type a column (i.e. wide representation), lets convert to long representation. We will leverage _conceptid for this. _source_concept_id will represent the source-vocabulary, which will be crosswalked to standard _concept_id for standardized analytics. cost_concept_id will be used to identify standard cost types. This allows to generalize the cost table and we represent arbitrary number of cost types and supports many different cost types by adding concept-id's for cost. We can also represent international and custom use cases. The concept_ids proposed in this spreadsheet

4. Add new field _event_table_conceptid

This field will allow cost table to be linked to any OMOP CDM table for which cost is being represented. cost_table_concept_id will be used to infer the source of the cost information thru vocabulary look-up and then joining by the respective pk. e.g. if the event_table_concept_id points to visit_occurrence table, then cost_event_id = visit_occurrence_id. See attached concepts spreadsheet with proposed new OMOP concepts

Field Required Type Description
cost_id Yes integer A unique identifier for each COST record.
person_id Yes integer A unique identifier for each person.
cost_event_id Yes ? No integer A foreign key identifier to the event (e.g. Measurement, Procedure, Visit, Drug Exposure, etc) record for which cost data are recorded.
cost_domain_id Yes string(50) The concept id representing the domain of the cost event, from which the corresponding table can be inferred that contains the entity for which cost information is recorded.
cost_event_table_concept_id Yes integer A foreign key identifier to a concept in the CONCEPT table representing the identity of the table whose primary key is equal to cost_event_id
cost_concept_id Yes integer A foreign key that refers to a Standard Cost Concept identifier in the Standardized Vocabularies.
cost_type_concept_id Yes integer A foreign key identifier to a concept in the CONCEPT table for the provenance or the source of the COST data: Co-ordination of benefits, Calculated from insurance claim information, provider revenue, calculated from cost-to-charge ratio, reported from accounting database, etc.
cost_source_concept_id Yes integer A foreign key to a Cost Concept that refers to the code used in the source.
currency_concept_id Yes integer A foreign key identifier to the concept representing the 3-letter code used to delineate international currencies, such as USD for US Dollar.
cost Yes float The actual financial cost amount
incurred_date Yes date The first date of service of the clinical event corresponding to the cost as in table capturing the information (e.g. date of visit, date of procedure, date of condition, date of drug etc).
billed_date No date The date a bill was generated for a service or encounter
paid_date No date The date payment was received for a service or encounter
revenue_code_concept_id No integer A foreign key referring to a Standard Concept ID in the Standardized Vocabularies for Revenue codes.
drg_concept_id No integer A foreign key referring to a Standard Concept ID in the Standardized Vocabularies for DRG codes.
revenue_code_source_value No string(50) The source code for the Revenue code as it appears in the source data, stored here for reference.
drg_source_value No string(50) The source code for the 3-digit DRG source code as it appears in the source data, stored here for reference.
payer_plan_period_id No integer A foreign key to the PAYER_PLAN_PERIOD table, where the details of the Payer, Plan and Family are stored. Record the payer_plan_id that relates to the payer who contributed to the paid_by_payer field.
don-torok commented 6 years ago

Whenever possible the description for '_concept_id' fields should include the vocabulary or domain to aid in the ETL getting the correct concepts. For example drg_concept_id: Standard concepts where vocabulary_id = 'DRG' revenue_code_concept_id: Standard concept where vocabulary_id = 'Revenue Code' currency_concept_id: ... vocabulary id = 'Currency'

Description for event_source_table_concept_id should state that this is for a new set of concepts, and since the proposal cannot be accepted with out the concepts being defined can also give the new vocabulary id. BTW, would not 'event_table_concept_id 'be a better name. When I see '_source_concept_id' I then look for a '_concept_id (e.g. drug_concept_id, drug_source_concept_id pairing)

If the cost_concept_id is to in reference to a new set of concept, proposal should say that and again provide the vocabulary name. I also think the new concepts need a definition for those of us that are not so closely tied to the payer space but still need to implement the ETL. It was fairly simple at one time because most of the column name in claim tables matched the names in the cost table, but that does not apply to EHR systems. Also is 'Paid by all payers' a summary, or is this a value that is likely to exist in a claims database?

Are we moving to always use date time?

cost_source_concept_id: I do not understand this column, assuming it is the 'source' that corresponds to the cost_concept_id, what concepts are these going to point to?

In proposed concepts Concepts.xlsx, I think the domain should be 'Metadata' instead of 'Type Concept.

gowthamrao commented 6 years ago

Recap from the Oct 3rd 2017 CDM workgroup, forums and symposium discussion

  1. Presented the need for person_id and incurred_date: discussion held, most in agreement. Details in first post.
  2. Add billed_date and paid_date: the proposed table will have total of three dates (incurred, billed, paid). @cgreich questioned the use-case supporting it. @pbr6cornell supported use-case of financial anlaytics. Details in first post.
  3. Change Cost table structure - to "normalized" version by using cost_concept_id and cost_source_concept_id: we went half-way thru this conversation, till we ran out of time. Discussions continued here . Most liked the pivoting of cost table. Concern 1: There was concern that this would increase the work during ETL, but the new structure allows for more flexibility and is not constrained by the cost-types restricted to legacy claims systems. @don-torok and ericaVoss. Concern 2: There was concern that end-users may add up cost components along with total-costs - leading to erroneous calculation due to double counting. It was proposed to have separate columns for total cost and component costs to avoid that, but that would make standardized analytics difficult. Many data sources do not have both components of cost and total cost for an encounter, and even if they do - many times components don't always add-up to total cost. To avoid ambiguity, the field name was changed from originally proposed 'Total Cost' to just 'Cost'. Follow-up conversations were held, to use the vocabulary to handle the concept-standardization.
gowthamrao commented 6 years ago

@don-torok
for event_table_concept_id please review this spreadsheet for proposed concept_id's. This is a thought -- why not assign every table in OMOP CDM should have a corresponding concept_id. @cgreich @pbr6cornell thoughts? The orange section is the proposed vocabulary, and the green section are the details. Discussion is here . These new concept_id's will be used for the proposed event_table_concept_id, but using this approach, it may also be useful in many visualization applications - where the app only needs to know the concept_id.

gowthamrao commented 6 years ago

@don-torok is this what you were recommending? Cost concepts

don-torok commented 6 years ago

Yes and no (: You have to understand that I am looking at this as someone who will be doing the ETL. So I will be trying to align what I see in the source documentation to columns, or in this case attribute concepts, in the CDM. The first part of the definition for co-payment is helpful

Co-payment is the fixed amount that is paid for a covered health care service. This payment is usually in addition to deductible. Also known as 'copays'. Copayments may differ based on services within the same plan, like drugs, lab tests, and visits to specialists.

It supplies a brief description, gives a couple of options for source of how the data may be described in the source documentation, co-payment or copays and it tells me it is not the same as a deductible.

However, this part of the description offers little help in doing the ETL defining the attribute in the CDM. Generally plans with lower monthly premium have higher copayments, and plans with higher monthly premiums have lower copayments.

Another example of where a detailed definition needed is to distinguish between Pharmacy ingredient (the amount charged by the wholesale distributor or manufacturer ) and Average Wholesale. I would like sufficient information to help me determine if a column labeled 'Wholesale cost' should go into Pharmacy ingredient or Average Wholesale.

gowthamrao commented 6 years ago

Recap from CDM workgroup on Nov 7th 2017

gowthamrao commented 6 years ago

@don-torok please check the updated spreadsheet here

Please provide input on cost concepts being proposed.

don-torok commented 6 years ago

To help me understand how the various concepts relate to the workbook. Can you state what are allowable domains for cost_domain_id, cost_concept_id and cost_type_concept_id. I see three new concepts but only two new domains type_concept and cost.

gowthamrao commented 6 years ago

Cost_domain_id is a string, it is not aconcept_id. It's always been there.. See current cost table.

Cost_concept_id only had one domain. Cost.

Cost_type_concept_id only has one domain. Type Concept

vojtechhuser commented 6 years ago

We may also need cost type level conventions. E.g., to calculate total premium cost per person, do I sum up over time all three types:

Premium, employee contribution
Premium, employer contribution
Premium

Or just Premium (because a convention will tell me that if I ever provide the first two, I have to provide the total in the premium cost type.

What if premiums are paid quarterly, or yearly?

gowthamrao commented 6 years ago

Interesting proposal @vojtechhuser , I will add it to the proposed concepts list. This type-concept (premium) would be unrelated to any visit or other cdm table then correct? The date payment of premium would be incurred, billed, paid - dates. In cases of premium, incurred date would be the date policy becomes active, billed would be the date the insurance company billed for the premium or the day the payment was due, and paid date is the date the premium was paid

Thoughts?

clairblacketer commented 5 years ago

added in v6.0

jenniferduryea commented 3 years ago

I know this thread is old and closed, but I do not see the specs for the updated COST table in the V6.0 documentation, specifically here: https://ohdsi.github.io/CommonDataModel/cdm60.html#COST. @clairblacketer , @gowthamrao - is this just a mistake? Where can I get updated specs for the cost table for v6? Thanks!

cgreich commented 3 years ago

What's missing, @jenniferduryea?

clairblacketer commented 3 years ago

I see the problem @cgreich @jenniferduryea. There is no person_id or dates as the issue suggests. I’ll push a fix in the next few days.

jenniferduryea commented 3 years ago

@clairblacketer and @cgreich both COST tables under the v6.0 and v5.3.1 specs are the same, even though the table has been converted from Wide to Long (unless you kept it wide? which was not my impression). The accompanying wording for ETL convention and description of the table does not reflect the "wide to long" conversion. Also, v6 still shows columns for paid_patient_copay, paid_patient_coinsurance, etc which I believe are now stored under the cost.cost_concept_id field for V6. Basically, the whole COST table has not been updated in the V6 spec from V5.3.1 to reflect all of the changes above.

clairblacketer commented 3 years ago

Labelled as Documentation to be fixed in the 2021 CDM Hackathon.

Final Decision

Fix the COST table in the current v6.0 specs and moving forward with v5.4. The correct specification can be found here.