darwin-eu / CodelistGenerator

Identifying relevant concepts from the OMOP CDM vocabularies
https://darwin-eu.github.io/CodelistGenerator/
Other
12 stars 8 forks source link

get_candidate_codes with search_synonyms = TRUE giving unrelated concepts #27

Closed daniellenewby closed 2 years ago

daniellenewby commented 2 years ago

Describe the bug Running get candidate code function in code list generator with "neoplasm of liver" gives concepts related to uterine when search_synonyms = TRUE.

To Reproduce Code below using local athena vocabs

Sys.time()

tic()
liver_codes2<-get_candidate_codes(keywords=c( "neoplasm of liver") ,
                                  domains="Condition",
                                  search_synonyms = TRUE,
                                  fuzzy_match = FALSE,
                                  exclude = c("risk",
                                              "fear",
                                              "benign",
                                              "screening",
                                              "suspected",
                                              "secondary",
                                              "in situ"),
                                  include_descendants = TRUE,
                                  include_ancestor = FALSE,
                                  db=db,
                                  vocabulary_database_schema =  vocabulary_database_schema)

toc()

Expected behavior These uterine codes should not come up as not related to neoplasm of liver

Screenshots see attached output from code above. liver_cancer_110722.csv

Codes not expected in search 196181 Tumor of uterine body - baby delivered 197049 Tumor of uterine body - baby delivered with postpartum complication 192385 Tumor of uterine body complicating antenatal care, baby not yet delivered 4147336 Tumor of uterine body complicating postpartum care - baby delivered during previous episode of care

Desktop (please complete the following information):

Additional context This might be an issue with the concepts hierarchy not necessarily the package itself

edward-burn commented 2 years ago

Thanks @daniellenewby for your nicely reproducible example.

Looking at this I see that a synonym found is "tumor of liver", and so this leads to "Tumor of uterine body - baby de liver ed". So I guess this is not a bug, but is a nice example of the importance of screening the codes returned.

In this case, you could also add "baby delivered" to exclude.

daniprietoalhambra commented 2 years ago

I found a similar issue when using the CodeGenerator Shiny app: where searching for 'multiple myeloma' and exclusions 'mgus', ticking the tool to use synonyms (no fuzzy matching) gave me tones (about 800) of unrelated/irrelevant concepts, most of them related to genetic abnormalities or parasitic infections. Some examples here: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

4021470 | Disease due to Trypanosomatidae -- | -- 4022818 | Disease due to Schistosomatidae 4035147 | Autosomal dominant hypophosphatemic bone disease 4035149 | Autosomal recessive hypophosphatemic bone disease 4109022 | Autosomal dominant polycystic kidney disease in childhood 37396747 | Autosomal dominant late onset Parkinson disease 37397000 | Autosomal dominant Charcot-Marie-Tooth disease type 2B 37397001 | Autosomal dominant Charcot-Marie-Tooth disease type 2C 37397002 | Autosomal dominant Charcot-Marie-Tooth disease type 2D 37397003 | Autosomal dominant Charcot-Marie-Tooth disease type 2E 37397004 | Autosomal dominant Charcot-Marie-Tooth disease type 2I 37397005 | Autosomal dominant Charcot-Marie-Tooth disease type 2J 37397007 | Autosomal dominant Charcot-Marie-Tooth disease type 2A1 36714330 | Autosomal dominant Charcot-Marie-Tooth disease type 2F 36714331 | Autosomal dominant Charcot-Marie-Tooth disease type 2G 36717726 | Autosomal dominant Charcot-Marie-Tooth disease type 2K 36717089 | Autosomal dominant Charcot-Marie-Tooth disease type 2L 36714332 | Autosomal dominant Charcot-Marie-Tooth disease type 2M 36714333 | Autosomal dominant Charcot-Marie-Tooth disease type 2N 36717210 | Autosomal dominant intermediate Charcot-Marie-Tooth disease type E

I've attached the full downloadable csv here in case useful CandidateCodes (1).csv

daniprietoalhambra commented 2 years ago

I don't know what you have done but I have tried again with the Shiny https://dpa-pde-oxford.shinyapps.io/OmopCodelistGeneratorConditions/ and now I do not get the same problem. I guess we can close this issue @edward-burn ?

edward-burn commented 2 years ago

Great! I updated the app to align with the latest version of the R package so that's good news