How to find the ones we did not map correctly?

PRijnbeek commented 4 years ago

Since we start from the standard concepts we do not see if we made errors in mapping to the wrong drug (with another ingredient).

We can of course string search the source to concept table but that will not be easy to automate i think?

Any ideas how we could best tackle that, if at all possible?

anthonysena commented 4 years ago

Thinking about this a bit and have a few thoughts. This idea is related to https://github.com/anthonysena/DrugUtilization/issues/6#issuecomment-571859563 and #9 so just trying to bring these conversations together.

For me, there are 2 scenarios in this area:

Source codes that went unmapped and are orphaned (drug_concept_id == 0)
Source codes that are mapped to the wrong drug

For 1, I created something to try and do this:

https://github.com/anthonysena/DrugUtilization/blob/master/inst/sql/sql_server/archive/create_drug_vocab_exploration.sql#L82

It is basic but gives us a start. Given a list of ingredients, it will build a wildcard search using the ingredient name (i.e. '%metformin%') and use that to find all concept entries with a mention of that name. I then use that list of concepts to try and identify orphan codes (i.e. mapped to drug_concept_id == 0). It is limited to those drugs that directly mention the ingredient name so perhaps this could be expanded to use the concept_synonym table as well.

For 2, I'm unsure how this would be detected post ETL - I think we'd wind up trying to do some type of term similarity as is done in Usagi. So, if the ETL'ers are using Usagi to do the mapping, my hope would be that they get this right at ETL design time.

We can try things like this out here but it feels like it is a bit out of scope for where I would like to see this package go which is more standardized analytics for drug utilization studies. I understand the criticality of getting the mappings correct but think that maybe this exist as its own set of functions (maybe in Usagi?).

cgreich commented 4 years ago

@anthonysena:

to 1) Wait. If it is not mapped all you have to do is to search for where drug_concept_id=0. Because the fact that there is a record means there is something. Then you can list the Source Values.

Your search would work for 2), though, and you would pull up records where those ingredients aren't in both the Source Value and Drug Concept. But the problem is different languages (Acetaminophen in German is called Parazetamol, in Dutch Paracetamol) and abbrevations which are chronic in the Source Values. What will help with the former is including into your search all Concept Synonyms of that ingredient, and the Concept Names and Synonyms of all the Source Concepts that are mapped to this Ingredient. Happy to help with the query. There is no remedy against abbreviations.

anthonysena commented 4 years ago

to 1) Wait. If it is not mapped all you have to do is to search for where drug_concept_id=0. Because the fact that there is a record means there is something. Then you can list the Source Values.

OK - happy to try and do it that way. It is simpler for sure and also has the benefit to see the impact on the overall drug_exposure table.

PRijnbeek commented 4 years ago

Wait :)! Drug_concept_id=0 yes but then you get all not only those in the list of interest.

Valuable still since we check all in QC of course but not ideal if you like to focus.

We tried just looking at our own mapping to zero for only those of interest (using string search) and this was informative and was reassuring.

anthonysena commented 4 years ago

Drug_concept_id=0 yes but then you get all not only those in the list of interest.

Agreed that this is a balancing act for sure. We can try the string search approach to start. The way I understood @cgreich's comment is: if we don't review everything, we can't determine what has been missed in the mapping. The string search approach gives us focus but we can still miss things.

Maybe we try the string search approach with consideration of synonyms to start to just see the potential impact?

cgreich commented 4 years ago

I see what you are doing: Looking for false negatives.

The other thing you may want to do: Look for those strings in the DEVICE table. The manual mappers won't probably make that mistake, but the Vocab Team: Everything that is a drug will be mapped, and the rest will be a device. They might misclassify.

anthonysena / DrugExposureCharacterization

How to find the ones we did not map correctly? #10