OHDSI / Vocabulary-v5.0

Build process for the OHDSI Standardized Vocabularies. Currently not available as independent release.
The Unlicense
222 stars 75 forks source link

Duplicate Standards #202

Closed ericaVoss closed 1 year ago

ericaVoss commented 5 years ago

For Vocabulary v5.0 18-JAN-19

I think I'm seeing duplicate standard codes with the same name.

SELECT CONCEPT_NAME, DOMAIN_ID, COUNT(*)
FROM CONCEPT
WHERE STANDARD_CONCEPT = 'S'
GROUP BY CONCEPT_NAME, DOMAIN_ID
HAVING COUNT(*) > 1

107826 rows return.

Specific example:

SELECT *
FROM CONCEPT
WHErE STANDARD_CONCEPT = 'S'
AND CONCEPT_NAME = 'Epirubicin'

image

Or is there some way for me to know which I choose?

aostropolets commented 5 years ago

Right, those shouldn't exist. Thanks for noticing (although you have "normal" concepts there like concept.vocabulary_id that belongs to different CDM versions as represented in concept_code). For drug domain please feel free to chose RxNorm/RxE only. For duplicates inside these vocabs- min(concept_id). For other domains just pick a random one. Once it's fixed, you'll get either a valid concept or replacement mapping that you can follow.

cgreich commented 5 years ago

Looks like we have a QA problem. Let's take a look.

dimshitc commented 5 years ago

And also there's a problem with PPI concepts: Despite their Answer-concepts are unique for every question, I mean "Are_you_smoking_Yes"and "Are_you_drinking_Yes" are different concepts, but they have the same name in the source. And if such answers don't exist in OMOP, we make all of them standard

dimshitc commented 5 years ago

Now it's 34658 duplicates. We fixed this GRR thing. And PPI fix is upcoming. I notice that we have duplicates in 'Geography' domain. for example: select * from concept where domain_id ='Geography' and concept_name ='Centro' ; there are 308 Centro. @Alexdavv , do they really have 308 cities/towns called Centro?

Alexdavv commented 5 years ago

@Alexdavv, do they really have 308 cities/towns called Centro?

They are districts/suburbans of different cities having a different hierarchy and geographic location. And even the names of the cities/towns are regularly repeated. To exclude the duplicates, we implemented the logic considering both geometry and hierarchy. So there is no way to use the SQL mentioned above for searching Geo duplicates. We thought about the modification of concept names, but the current decision was considered to be an optimal one.

cgreich commented 5 years ago

Can you show examples, @Alexdavv?

Alexdavv commented 1 year ago

Can you show examples, @Alexdavv?

https://athena.ohdsi.org/search-terms/terms?standardConcept=Standard&page=1&pageSize=15&query=Centro

Currently, we have a QA check that makes sure that we don't introduce the concepts with the same names. In OSM, NAACCR, NCD vocabularies and some other places it's expected. If you find something relevant in the future, please report it.