OHDSI / ETL-CDMBuilder

ETL-CDMBuilder is a repo containing a .NET Core application to perform ETL to OMOP CDM for multiple databases
Apache License 2.0
50 stars 30 forks source link

Mapping to invalid concepts when valid concepts available in MarketScan ETLs #49

Closed schuemie closed 3 years ago

schuemie commented 5 years ago

For example: currently the ICD-10 code T42.6X2D (Poisoning by other antiepileptic and sedative-hypnotic drugs, intentional self-harm, subsequent encounter) is currently mapped to concept ID 35216220, which is invalid. Instead, it should map to 45594639.

Perhaps relevant: the invalid concept has concept code 'T42.6x2D', whereas the valid concept has 'T42.6X2D', so a capital 'D'.

bradanton commented 5 years ago

I have checked and it is mapped to 45594639, for ICD-10 in MarketScan only valid codes are used
(TARGET_INVALID_REASON IS NULL or TARGET_INVALID_REASON = '')

https://github.com/OHDSI/ETL-CDMBuilder/blob/master/bin/TruvenV5/Lookups/Condition_ICD10.sql

35216220 has invalid_reason='D'

schuemie commented 5 years ago

Sorry, when I said 'mapped' I meant that it is assigned this source concept ID. From there it does not map to anything, because that source concept is valid.

I observed this in our internal MDCD database:

SELECT condition_source_concept_id 
FROM cdm.condition_occurrence 
WHERE condition_source_value = 'T426X2D' 
LIMIT 1;

--condition_source_concept_id
----
--35216220

SELECT * FROM cdm._version;
--version_id | version_date
---- | --
--780 | 2018-08-31
bradanton commented 5 years ago

We use different filters for condition_concept_id and condition_source_concept_id.

From specification (https://github.com/OHDSI/ETL-CDMBuilder/blob/master/man/TRUVEN_MDCD/Truven_MDCD_ETL_CDM_V5.2.0.docx) :

CONDITION_CONCEPT_ID - WHERE SOURCE_VOCABULARY_ID IN ('ICD10CM') AND TARGET_STANDARD_CONCEPT IS NOT NULL AND TARGET_INVALID_REASON IS NULL

and

CONDITION_SOURCE_CONCEPT_ID - WHERE SOURCE_VOCABULARY_ID IN ('ICD10CM') AND TARGET_VOCABULARY_ID IN ('ICD10CM')

select distinct condition_concept_id,       
       condition_source_concept_id
from CDM_Truven_MDCD_v780.dbo.condition_occurrence 
WHERE condition_source_value = 'T426X2D' 
condition_concept_id    condition_source_concept_id
443283  35216220
schuemie commented 5 years ago

I'm not sure, but I think we should change that rule to also filter source_concept_ids to valid concepts. Right now I can't use these concepts in ATLAS because the SQL ATLAS generates always adds and invalid_reason is null when building a concept set.

@clairblacketer , @chrisknoll , any thoughts?

chrisknoll commented 5 years ago

My primary thought is that the vocabulary should remove all 'invalid' concepts so that we don't have this confusion.

For this specific case, are they saying they may use an invalid concept to map to a condition_source_concept_id? I agree with you @schuemie, that rule should be changed. We shouldn't find invalid concepts anywhere in our final ETL. That's why atlas doesn't let you pick those (because there's so many invalid concepts in the vocabulary you'd easily make a mistake. See my first comment for my thoughts on that).

cgreich commented 5 years ago

T42.6X2D (Poisoning by other antiepileptic and sedative-hypnotic drugs, intentional self-harm, subsequent encounter) is currently mapped to concept ID 35216220, which is invalid. Instead, it should map to 45594639. the invalid concept has concept code 'T42.6x2D'

Сr***. This thing should not exist. Will figure out how it snuck in again. We have been fighting them for a while now.

All ICD codes have capital letters. If MarketScan gives you erroneous lowercase you have to fix that during ETL. And I really wished you guys would switch on case-sensitivity in your SQL Server installation there. Usually @ericaVoss gives me grief by finding those "issues", when they turn out case sensitivity issues. concept_code values have to be case sensitive.

for ICD-10 in MarketScan only valid codes are used

MarketScan reports ICD-10-CM codes, not ICD-10 codes. And the validity is a function of time. What used to be valid can be invalid now. Truven does not go back to the old records and replaces codes which concluded their life cycle. So, no matter how much we love this product, we need to carry the invalid Concepts forward indefinitely.

SQL ATLAS generates always adds and invalid_reason is null when building a concept set

When you want to build cohorts based on Source Concepts you will have to allow invalid ones. Not sure what Atlas does now, I try avoiding those at all costs anyway. I'd rather fix the mapping.

We shouldn't find invalid concepts anywhere in our final ETL

Again, in the source_concept_id you have to, in the concept_id you never will. All Standard Concepts have to be valid, otherwise they can't be Standard.

dimshitc commented 5 years ago

Сr***. This thing should not exist.

It doesn't exist already.

concept_id concept_name domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
35216220 Duplicate of ICD10CM Concept, do not use, use replacement from CONCEPT_RELATIONSHIP table instead Condition ICD10CM 7-char billing code   35216220 2007-01-01 2017-12-31 D

vocabulary version = '10-JAN-19'

schuemie commented 5 years ago

Thanks. I see you're right. The concept hasn't been removed altogether, but because the concept code is set to the concept ID we would no longer find it.

I agree with @cgreich's point: since the source_concept_id field is meant to just represent whatever is in the source data, we can't be picky and say some things aren't allowed. @chrisknoll : what exactly would be the concern in not filtering these in the Circe SQL? Now it is actually confusing, because ATLAS does show these concepts in the concept sets, but when building the cohort they are filtered out.

cgreich commented 5 years ago

It doesn't exist already.

Thanks, @dimshitc, but please also eradicate the concept_code. Only the concept_id HAS to be perpetuated. I really don't want anybody even with good intentions (or bad database settings) to pull this up any longer.

chrisknoll commented 5 years ago

The issue I have is the confusion with what 'invalid' means. Back in 2013 when we were first using CDMv5 vocabulary,w e had a lengthy discussion without the concepts with 'invalid reason' values that are not-null, and the conclusion at that time was that those concepts with a non-null invalid reason were left behind for debugging purposes. But those concepts wouldn't be used anywhere in the ETL (and this goes for non-standard concepts). there is a valid_start and valid_end of concepts such that you use that information to detemrine if a given record should be mapped to a source concept (only bring it in if the date of the record is withint he valid_start and valid_end of the concept).

So, before letting invalid concepts into a concept set expression's result (and having unknown consequences) , I'd like to get clarification on why these concepts are 'invalid' or if they should have even been marked as invalid.

As an example: consider a non-standard concept for an NDC code that has been reused twice. You would have 2 concept records:

concept_id concept_name domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
10001 NDC From 2001 Drug NDC Branded Drug   NDC001 2001-01-01 2006-12-31
78251 NDC From 2012 Drug NDC Branded Drug   NDC001 2012-01-01 9999-99-99

It feels like the approach is to mark concept ID 10001 (the old NDC code) as invalid. But why? It's a valid cocnept for NDC source code NDC001 from 200-through 2006. It can still be used in mapping, and it's still a good concept code. My arguemnt is to leave both of those in as valid and stop putting invalid concepts into the vocabulary (so we don't have this confusion over and over).

In @dimshitc of the ICD that has a name of 'Duplicate of...', why is that even here? The description says not to use, why would you want to include it in the vocabulary if you are explicitly told not to use it?

cgreich commented 5 years ago

@chrisknoll:

the conclusion at that time was that those concepts with a non-null invalid reason were left behind for debugging purposes.

Not quite. The validity defines whether the code originators maintain the semantic content of a concept. They (which could be us) could deprecate concepts for different reasons:

BTW: the taking of a drug off the market is not a reason for the concept to die, in contrast to what people sometimes believe.

When that happens, we also deprecate a concept. However, till that deprecation the concept was valid, and it was in the data. So, we have to keep it.

letting invalid concepts into a concept set expression's result

If you do the right thing and use Standard Concepts you are safe. We will not have invalid Concepts as Standard. We redirect all mapping traffic from source codes away to other, valid, Standard Concepts when that happens.

However, if you create concept sets of source concepts you will have to allow invalid ones, because they used to be valid back in the time. How else do you want to deal with that problem?

consider a non-standard concept for an NDC code that has been reused twice

Actually, that is not the use case we are covering here. Even though there is a lot of noise about these reused NDCs, in practice this is exceptionally rare (12 weirdo NDCs in 600k proper ones), and is no longer going to happen in future. Right now, we can only see the "latest" reused code. Because within a vocabulary we allow only unique concept_codes (with the exception of DRGs).

the ICD that has a name of 'Duplicate of...', why is that even here?

Because that is an artifact from a violation of the OMOP Standardized Vocabularies rules: within a vocabulary each concept_code must be unique. Of course, if you bring in codes from some fishy source where the concept_code has lower case letters instead of upper case it will escape the QA, but is still wrong. These happened twice in 2012 and 2015 or so. So, we need to kill and burn every one of these pseudo-duplicates, so that an innocent soul like @schuemie cannot bump into them even by chance. But we need to keep the concept_id, in case somebody used it in the past.

clairblacketer commented 5 years ago

@schuemie I agree that the "invalid_reason is null" filter should be removed from the CIRCE sql relating to source_concept_ids. We do not impose this constraint on our source codes when ETL-ing so it stands to reason we should be able to use them in ATLAS.

I think this and the F32.9 issue we say the other day may, in part, be due to our outdated vocabulary. Hopefully once we adopt the new one some of these problems will be solved.

chrisknoll commented 5 years ago

However, if you create concept sets of source concepts you will have to allow invalid ones, because they used to be valid back in the time. How else do you want to deal with that problem?

If they are valid, then they are valid! even tho they are no longer in use, you have a valid_start and valid_end date for every concepts. So, to answer your question, you handle it with looking at the valid_start and valid_end date and act accordingly.

The problem is you are conflating 'concepts that expire' with 'we made a mistake in creating a concept'. Currently, the cohort expression is not going to return any concepts that say 'this was a mistake, due to typo, duplicate, whatever'. that's all rolled up into invalid_reason. but you're saying 'oh but they were valid at one time!'. Well, then they shouldn't be invalidated, and you should leverage valid_start and valid_end to support that.

@schuemie , @clairblacketer , i'm trying to come to a solution that works in all cases. We wouldn't have the ETL problem if there weren't invalid concepts being put into the vocabulary to confuse the ETL. Likewise, you wouldn't need to look for these confusing concepts in circe if they weren't there in the first place. Regardless of making the change in circe, we're still left with this ambiguous idea of 'what is an invalid concept, and why should we be using invalid concepts in any part of our analysis when they are...invalid'

chrisknoll commented 5 years ago

I agree that the "invalid_reason is null" filter should be removed from the CIRCE sql relating to source_concept_ids

@clairblacketer : when creating a concept set, there's nothing in there saying 'this is for source concepts' vs. 'this is for standard concepts'. It's just an expression that yields a list of concept IDs, so I'm not sure how to address your comemnt when you say 'when dealing with source_concept_ids'

cgreich commented 5 years ago

@chrisknoll:

Not sure what you are suggesting or saying.

So, you handle it with looking at the valid_start and valid_end date and act accordingly

Hm. But the invalid_reason is nothing else but a valid_end_date<=31-Dec-2099 (or really, <=SYSDATE of the day of the release). Strictly speaking, invalid_reason is a redundant field and should not be there.

But the concept was valid before. Then, somebody, not us, said "do no longer use it, it's wrong, for reason XYZ", or without giving a reason. We capture that fact. And we stop using it as a Standard Concept and find another one instead. I don't see a problem.

For the cohort definition in ATLAS, use Standard Concepts. Then you don't have any problem. If, however, you insist on using Source Concepts you have to deal with the fact that the Concept maker deprecated the concept at some point in time, and it stopped showing up in the data. Nothing we can do.

The above situation, where we made a mistake and had to invalidate (duplicate ICD10CMs with lower case letters in concept_code), is extremely rare. We don't just invalidate these, we burn them, so they cannot possibly accidentally used anymore. Well, unless some of them slip through, like in the case that @schuemie found.

chrisknoll commented 5 years ago

@cgreich , What I was saying/suggesting is how to interpret the invalid_reason column. My assumption was that concepts are 'usable' as long as invalid_reason is not null. the valid_start_date and valid_end_date was a separate set of information about when the source concept is 'active', such that if I look for source Concept 001 which is valid only in 2001, i'd only find that source concept on records that were dated in 2001.

But, as you've explained, the invalid_reason is just a proxy for the valid_end_date being 'valid as of the date of the vocabulary release'. That doesn't explain how 'D' vs. 'U' values get into the reason, but I don't really want to know. It sounds like I should just ignore invalid_reason because:

  1. For standard concepts, they are always invalid_reason is null.
  2. For everything else, the ETL will use the concept only if the record exists during the valid_start and valid_end of the concept, so regardless of the invalid_reason value, we should look for those concepts.

So, I can remove the is null check from the concept set query, since it doesn't seem to matter what invalid_reason is. If you can explain the use case where you would need to know that invalid_reason is null, I'd like to hear it. It doesn't sound like there's a use case for it.

chrisknoll commented 5 years ago

I was just creating an issue for this in circe-be, and found a prior issue request that led to this thread:

http://forums.ohdsi.org/t/proposal-to-keep-outdated-standard-concepts-active-and-standard/3695/23

@cgreich , were you aware of this decision? Or this this a special case of standard concepts vs. another special case of non-standard concepts?

clairblacketer commented 5 years ago

@chrisknoll this was a CDM workgroup decision specifically for CPT4 codes, not for all invalid concepts

chrisknoll commented 5 years ago

Thanks for the clarification @clairblacketer . Hopefully we can have a consistent handling of these fields in the future.

cgreich commented 5 years ago

@chrisknoll:

Good catch. This is a bandaid. We need to fix it properly. Reason we are having it is because for HCPCS and CPT4 we don't have standard concepts, but those are the standard concepts themselves. When the AMA or CMS decide deprecate one of them, we need to do so as well, making them invalid and non-standard. Which means, these codes cannot be used in the concept_id fields anymore, they have to be set to 0. As a stopgap we created this interim solution. In the long run, @dimshitc's procedure hierarchy will solve it through either remapping them or creating Extension concepts catching the content.

cgreich commented 5 years ago

@chrisknoll - and your question before:

the invalid_reason is just a proxy for the valid_end_date being 'valid as of the date of the vocabulary release.

Correct.

That doesn't explain how 'D' vs. 'U' values get into the reason, but I don't really want to know.

Too late. :) For deprecated ones, if there is a "concept replaced by" relationship to another standard concept the invalid_reason becomes "U" for "updated". Otherwise, "D" for "deprecated". But you could glean that information from the CONCEPT_RELATIONSHIP table just as much.

It sounds like I should just ignore invalid_reason

You could, if you want to get that info from valid_end_date and CONCEPT_RELATIONSHIP. Except, the CPT4 and HCPCS bandaid discussed above.

So: If I were you I would:

if I look for source Concept 001 which is valid only in 2001, i'd only find that source concept on records that were dated in 2001

I would not bank on that. For example, when the CMS deprecates a HCPCS it will take a while till every country yokel doctor's office has realized that and stops using it. We could run a nice query in the data and see the decay in the use of deprecated codes over time. But it is sure as hell not abrupt.

If you can explain the use case where you would need to know that invalid_reason is null, I'd like to hear it

  1. For the understanding of the use of concepts over time. That could have an effect of incidence rates of things.
  2. For picking concepts in ATLAS for conceptsets. You type something in and get 3 SNOMED concepts back, 3 of which might be updated and refer to the third.
  3. The vocabulary team of cause uses this information heavily to manage the concepts and relationships.