OHDSI / OncologyWG

Oncology Working Group Repository
https://ohdsi.github.io/OncologyWG
Apache License 2.0
56 stars 24 forks source link

Ingest ICDO 3.2 into the OMOP vocabulary tables. #84

Open mgurley opened 5 years ago

mgurley commented 5 years ago

Not needed immediately. But entering this to put it on the horizon. See here:

http://www.iacr.com.fr/index.php?option=com_content&view=category&layout=blog&id=100&Itemid=577

mgurley commented 4 years ago

This looks like it has been completed. However, it appears that we did not include all the semantic content possible from the ICD-O-3.2 downloadable file. See comment laying out the issue:

https://github.com/OHDSI/OncologyWG/issues/238#issuecomment-570099519

There are 6 different "levels" or semantic types in the ICD-O-3.2 download: "Preferred", "Synonym", "Related", "1", "2" and "3". It looks like we only ingested into the OMOP vocabulary the "Preferred" and "Synonym" levels. The "Related" level looks like a kind of synonym. I think CONCEPT_SYNONYM is the only logical place for us to put them. "1", "2", "3" look to be classification concepts. I think having them in the OMOP vocabulary would provide greate value for folks wanting to pull groups of histologies into higher-level concepts.

ekorchmar commented 4 years ago

Hello! I have worked on the most recent ICDO3 Vocabulary release.

We indeed have skipped "Related" terms altogether, because they are distinctly not equivalent to terms with the same codes in "Synonym" and "Preferred" level, and are rather subtypes of these ideas. For instance, there is a code 9971/1 with Preferred term "Post-transplant lymphoproliferative disorder, NOS" and Related term "Polymorphic post-transplant lymphoproliferative disorder". They are not fully equivalent. Thus, encounter in patient data with the code 9971/1 always means PTLD, but not necessarily Polymorphic PTLD.

If we put Polymorphic PTLD in synonyms with PTLD, it is possible that people looking specifically for Polymorphic PTLD will get generic PTLD instead and use it in their concept sets, unaware about the limitations of ICD-O-3 coding system.

In some cases Related terms later turn out to be equivalent Synonyms, as evident from ICD-O-3 changelogs. But as a rule we don't try to outsmart the source data, so we ingest everything as is.

If it is necessary, we could preserve related terms separate concepts with mock codes, and build a special relation to their original concepts. This way, researches could find how their desired concepts could be coded in actuality, but do we have a valid use-case to do so? ICD-O-3 implementation is complex enough as is, with combination codes for conditions.

As for classification levels 1-3, we could indeed include them. What should they look like? Classification Condition domain concepts that have all possible conditions as descendants? Or just use them as Observations to group ICDO Histology concepts? It's all down to use-cases.

mgurley commented 4 years ago

@ekorchmar

Yes, this is a conundrum. The fact that ICDO3.2 reuses codes across preferred and related terms is bad. Pathologists will declare and consider them as distinct diagnoses. So I guess having them only as entries in CONCEPT_SYNONYM is inappropriate. So creating them as full-fledged entries in Observation seems to be the right approach. But I know creating duplicate concept_code entries within the same vocabulary is not allowed. So we would need to, as you suggest, create mock codes. I think this is the right approach.

Folks coming from NAACCR tumor registry data will not have any more context to choose than the 'Preferred' term (only having the ICDO3 code) but folks extracting ICDO3 diagnoses from pathology LIMS systems (from discrete CAP eCC data or curated via NLP-aided chart abstraction) will have the context of the fully declared diagnosis to be able to pick the 'Related' term.

For example, At NU we are extracting ICD3 diagnoses for brain tumor pathology reports via NLP-aided chart abstraction. We have had to manually load the full ICDO 3.2 vocabulary outside of the OMOP vocabulary becuse the full ICDO 3.2 version is not fully supported in the OMOP vocabulary. Our NLP needs to have the token variations to be able to recognize the 'Related' terms. Plus investigators want to be able to distinguish the 'Preferred' from 'Related' diagnoses.

Regarding the classification levels, I think creating them as a Classification Observation domain concepts would make the most sense. So just using them to group ICDO histology concepts. We can always join to Condition concepts via CONCEPT_RELATIONSHIP to find participating 'ICDO Condition'.

dimshitc commented 4 years ago

If we want hierarchy, we need to make these concepts standard, to make them standard, we need to investigate whether they don't have equivalence in existing standard terminology (SNOMED). This is time consuming. So, if @mgurley 's team needs sets of related descriptions, they can use concept_relationship, something like ICDO code - 'Has related term' - that term (concept_code is generated, something like OMOP12345) Note, all these terms will be non-standard.

cgreich commented 4 years ago

Which is why I am cautious to add in all those vocabs right away. One thing at a time.

ekorchmar commented 4 years ago

I don't think Related concepts would ever need to be standard, since they would only exist as effective synonyms to point researchers to "real" concepts. However, I still can't imagine what is the actual use-case for such concepts in applied use of OMOP CDM. They have to be supported from release to release: changelogs of ICD-O-3 sources are full of precedents when Related terms become Synonyms (if scientists deem them not really that different) or when Related terms become Preferred terms of entirely new codes (if they are different enough to be interesting). All these transitions would need separate logic to ensure consistency between releases. A lot of effort for very unclear use-cases.

As for Classification levels of histologies, I see no problem with having them. They are easy to implement, they are provided explicitly and use-cases also seem pretty clear: for ex., various subtypes of lymphomas are very neatly grouped with different granularity. Should I put it in the backlog for the next vocabulary update?

cgreich commented 4 years ago

I agree with Edik. This is typical stuff for navigation screens. You got a assign a patient it will suggest things which are slightly right and left of what you are looking at. We don't need those relationships. We do need the concepts, though. And we have them, right, Edik?

ekorchmar commented 4 years ago

Related concepts? Currently they are not implemented at all. We could have them as stub concepts.

mgurley commented 4 years ago

@ekorchmar @cgreich

@ekorchmar said:

"We indeed have skipped "Related" terms altogether, because they are distinctly not equivalent to terms with the same codes in "Synonym" and "Preferred" level, and are rather subtypes of these ideas. " So the "Related" are not the same as the "Preferred". So, what if I want to study the subtypes? If they are present in my data, but I have no place to map them to in OMOP, then I am out of luck. The use case, to me, is rather obvious.

ekorchmar commented 4 years ago

The problem is that ICDO3 model does not allow to store exact Related subtypes in data. It does not matter if they are or are not in OMOP. From example above (https://github.com/OHDSI/OncologyWG/issues/84#issuecomment-570191908): if source data contains entry '9971/1', it can originally mean simply "PTLD" , or it could mean "Polymorphic PTLD", a subtype of "PTLD". But nothing tells if it is a subtype, unless there is another nonconventional field in a particular registry or a free text diagnosis field.

Only purpose for such concepts would be to serve as waypoints. If I want to research specifically Polymorphic PTLD, existence of such waypoint would show me that such diagnosis is known in ICDO3 vocabulary, but only as a not-really-that-different subtype of PTLD. Knowing this, I would now look for another source of truth to determine which patients got diagnosed with Polymorphic PTLD. Perhaps I would look for codes from other classifications, which can actually be mapped to SNOMED's 42538580 Polymorphic lymphoproliferative disorder following transplant. And this now falls outside of scope of ICDO3 vocabulary.

So, it's a very limited usecase.

mgurley commented 4 years ago

@ekorchmar ICDO3 is not only useful for ingesting cancer diagnoses from tumor registries. ICD03 tracks the new morphology codes and terms from the 4th series of WHO Classification of Tumours (Blue Books). See here:

http://www.iacr.com.fr/index.php?option=com_content&view=article&id=149:icd-o-3-2&catid=80&Itemid=545

So ICDO3 is the only machine-readable vocabulary tracking the latest practice of how working pathologists declare cancer diagnoses on pathology reports in the real world. If you go back to my prior comment https://github.com/OHDSI/OncologyWG/issues/84#issuecomment-570793475, you will see that my use case covers extracting cancer diagnoses from pathology reports via NLP-aided curation. So no discrete ICDO3 codes. Just text from pathology reports. Hence, the NLP/human curation wants to be able to pick the right subtype concepts. We need to think beyond tumor registries.

I think we should bring this issue to the Oncology CDM/Vocabulary subgroup.

cgreich commented 4 years ago

I think we have a similar situation with MedDRA. The LLT are sometimes synonymes of the PT, and sometimes descendants. Which is why we essentially ignore them (map the PT).

Unless there are really really important I would put that on ice for the moment. Create an issue and downprioritize them.

ekorchmar commented 4 years ago

Here is example from current ICDO3 release.

code Level Term
9945/3 Preferred Chronic myelomonocytic leukemia, NOS
9945/3 Related Chronic myelomonocytic leukemia, type I
9945/3 Related Chronic myelomonocytic leukemia, type II

Type I and Type II are clinically distinct and even may have different treatments (chapter Types of CMML). We can't just use them as synonyms of same concept, even if ICDO does not provide separate codes for them.

cgreich commented 4 years ago

Michael:

In principle, you are right. But:

  1. These variants are different, as Edik explained, but they have the same ICD-O-3 code. So, if your data only contains codes you wouldn't know the difference. If it contains the lexical variant we could distinguish. What does it have?
  2. If the latter, I would still suggest we don't do that in V1. Put it in the backlog. Emphasis on "back".