JULIELab / trec-pm

Support code and resources for participation at the TREC Precision Medicine Track (TREC-PM)
http://trec-cds.appspot.com
MIT License
9 stars 2 forks source link

MML Treatment File #4

Closed michelole closed 5 years ago

michelole commented 5 years ago

Parse MML treatment file and add it to ES index.

Format is PMID UMLS_CUI preferred_term, e.g.

10  C0012258    digitoxin

Each PMID links to 0:n UMLS CUIs. UMLS CUIs are not unique, even in a single abstract.

michelole commented 5 years ago

I did some quick analysis on the quality (using quantity as a proxy) of MetaMap annotations.

$ find . -type f | head -n 100 | xargs cut -f 3 | sort | uniq -c | sort -rn | head -20
34258 various
21546 combinations
9516 enzymes
7712 insulin
6796 calcium
6748 level
6088 drug
4398 plasma
4176 iron
3766 mediated
3658 animals
3602 aim
3568 water
2909 liver
2539 drugs
2379 basis
2068 duration
2064 lead
2048 potassium
1971 zinc

It seems we need to filter out some:

Maybe we should redo this analysis looking at top-k docs retrieved by our system.

michelole commented 5 years ago

This may not be trivial and need some form of parallelization. Shall we do it in UIMA, @khituras ?

Some stats to aid in the decision process:

Optionally, we could do at query-time, but it will need some extra RAM/CPU/time to load the file into memory.

khituras commented 5 years ago

Sorry, my head is not up to date here, what exactly would you like to do that is hard to do? The numbers don't look too shocking to me on first sight.

michelole commented 5 years ago

Treatments are in the index, closing this now.