Closed michelole closed 5 years ago
I did some quick analysis on the quality (using quantity as a proxy) of MetaMap annotations.
$ find . -type f | head -n 100 | xargs cut -f 3 | sort | uniq -c | sort -rn | head -20
34258 various
21546 combinations
9516 enzymes
7712 insulin
6796 calcium
6748 level
6088 drug
4398 plasma
4176 iron
3766 mediated
3658 animals
3602 aim
3568 water
2909 liver
2539 drugs
2379 basis
2068 duration
2064 lead
2048 potassium
1971 zinc
It seems we need to filter out some:
Maybe we should redo this analysis looking at top-k docs retrieved by our system.
This may not be trivial and need some form of parallelization. Shall we do it in UIMA, @khituras ?
Some stats to aid in the decision process:
Optionally, we could do at query-time, but it will need some extra RAM/CPU/time to load the file into memory.
Sorry, my head is not up to date here, what exactly would you like to do that is hard to do? The numbers don't look too shocking to me on first sight.
Treatments are in the index, closing this now.
Parse MML treatment file and add it to ES index.
Format is
PMID UMLS_CUI preferred_term
, e.g.Each PMID links to
0:n
UMLS CUIs. UMLS CUIs are not unique, even in a single abstract.