Paper Review: Detection and Normalization of Medical Terms using Domain-specific Term Frequency and Adaptive Ranking

Publisher

University of Alberta

Link to The Paper

https://sites.ualberta.ca/~miyoung2/Papers/2010_ITAB.pdf

Name of The Authors

Mi-Young Kim and Randy Goebel

Year of Publication

2010

Summary

The research proposes a tool to automatically extract information about diseases and treatments by performing the detection and normalization of medical terms and sentences using the Unified Medical Language System (UMLS) meta-thesaurus. This is combined with a document retrieval technique based on domain-specific term frequency and adaptive ranking, which dynamically determines the relevant documents for each sentence, thus avoiding the need for a static cut-off threshold to retrieve documents. UMLS here integrates over 2M names for some 900K concepts for more than 60 families of biomedical vocabularies including 12M relations among these concepts. This automatically and unsupervisedly maps medical sentences into UMLS medical concept IDs according to the corresponding medical terms.

Important metrics used in the experiment-- Document Retrieval Method: language model-based as it shows best performance. It computes the probability that the word is generated from the document d and subtracts the value from 1. This term is the penalty for the word that does not exist in the query. Document-specific term frequency for specificity: for the words that appear in a document, domain-specificity value is assigned according to the extent to which it is domain-specific Document Frequency for Ellipsibility: The larger the document frequency of a term is, the bigger the ellipsibility of the term becomes. as it is more common & less information content it has. Distance between words for proximity: proximity-based weighting of query term occurrences in the document, when the words t and r exist in the same document and same query, the closer the occurrence of a query word t is to the occurrence of word r within the same input sentence, the more it will contribute to the word t’s weight in the document. Adaptive Ranking: Adaptively determine the relevant documents or each query without a static cut-off threshold. For two words in a relevant document, if one word appears in the query and the other does not, the word that appears in the query should not have less information content than the word that does not appear in the query. To decide if a new document should be added to the list of relevant documents, we check if the words from the query that is in the new document are also in the existing relevant documents

The experiment used 2 kinds of evaluation data: annotated data from EBI for the detection of diseases (600 sentences, 924 disease terms). data from UC Berkeley to detect the functional relations b/w diseases and treatments which is not annotated. This is manually mapped into the UMLS ontology. (3655 sentences, 1643 disease terms, 1182 treatment terms). They constructed the disease lexicon and treatment lexicon using the concepts of each. The Results 1. achieved a precision of 71.37%, a recall of 76.03%, and an F-measure of 73.63%. outperformed the statistical method, dictionary-based method of EBI and MetaMap by 5.24 to 8.24%, and the method using traditional term frequency by 11.49 %. For UC Berkeley data, that achieved an F-measure of 73.93% for diseases, and 72.71% for treatments. without applying annotated data outperformed the existing method using CRF based on various features, and the method using traditional term frequency.

Contributions of The Paper

Applied a document retrieval technique using domain-specific term frequency used to compute Normalized term entropy between two domains. To choose the relevant documents for each query, they used adaptive ranking based on the principle ‘one document per named entity’. The experimental results outperform the previous methods in detecting and normalizing medical terms.

Comments

Limited only to the medical field

RAISEDAL / RAISEReadingList