Successful strategies from i2b2 shared tasks

michelole commented 6 years ago

Take a look on what other people did in the past in similar tasks.

See https://dbmi.hms.harvard.edu/programs/healthcare-data-science-program/clinical-nlp-challenges/7-2014-deid-heartdisease

kugami commented 6 years ago

According to this paper: "Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2"

general information this task focused on identifying medical risk factors related to Coronary Artery Disease in the longitudinal medical records of diabetic patients a note about the corpus: the training data consisted of 60% of the total corpus (790 records) and the testing data consisted of the remaining 40% (514 records)

submissions -- what did other people do? note: this will be listed based on the past ranking, so 1. equals first place in the past competition

1st place: approach of mention level classification task,

using the spans highlighted by the annotators
they reannotated two thirds of the training corpus
standardized both the mention spans and annotated both positive and negative mentions

this data was used for training, and then preprocessing: identification of section headers, negation words, modality words and output from ConText, rules were used for locating trigger words, medication and measurement, SVM classifiers were used to identify the validity and polarity of each mention, smoking status was identified by using a single 5-way classifier and a separate rule-based classifier for family history

2nd place: Divided risk factors into three categories

Phrase-based factors (finding relevant phrases in the text)
Logic-based factors (form of analysis needed after identifying the relevant phrase)
Discourse-based factors (require parsing a sentence, e.g. identifying smoking status or family history)

After preprocessing the texts with MedEx (=medication information extraction system for clinical narratives)

Using Conditional Random Fields (CRF) and
Structural Support Vector Machines (SSVMs) >>> identifying risk based factors
NegEx used for negation
SVMs (normal) used for identifying discourse based risk factors
Multi label classification approach to assign temporal attributes to risk factors

3rd place: Approach as a multiple text categorization task

Combination of tag and attribute value pairs = independent target category
Feature sets were created for these pairs
- centered around “hot-spot keywords”
- fed into Weka’s JRip classifier
a second classifier built for smoker classification
- SVM utilized, could be overruled by regular expression
A third classifier supplemented more information to the output
- this classifier based on Stanford’s Named Entity Recognition Tool

Important Takeaway’s Between the top-performing approaches, there were some similarities Pre-processing tools to gain syntactic information and only one (3rd place) added temporal attributes Nearly all the systems used medical lexicons, either as UMLS, Drugs.com and Wikipedia Only one of the teams did not mention using a lexicon of medical terms Hypertension and Family History had the best performance over all risk factors (result partly due to the collection of files – mostly indicated no family history at all) Among the top few – mostly all of them had similar performance in the system tests

michelole commented 6 years ago

👍 so... no neural nets?

bst-mug / n2c2

Successful strategies from i2b2 shared tasks #5