EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings
MIT License
673 stars 135 forks source link

Fix Date Bug in I2B2 2006 Preprocessing #27

Closed jenchen1398 closed 4 years ago

jenchen1398 commented 4 years ago

In to_conll.py processing code for I2B2 2006 DEID data, many Date examples are discarded and cast as 'O' instead of 'B-DATE' and still contain __phi__ as tokens in the processed tsv files. In the deid_surrogate_train_all_version2.xml file, there were 5167 total TYPE="DATE" markers, but in the processed train.tsv I got after running to_conll.py, 3035 of these examples became cast as 'O' and had the __phi__ token.

I believe the script fails when there's a PHI XML object with test before or after it, but no space. Broadly, these two types of cases:

  1. <PHI TYPE="DATE">01/01</PHI>/1995 or <PHI TYPE="DATE">01-01</PHI>-95
  2. 1995<PHI TYPE="DATE">0101</PHI>

Updated to_conll.py script checks for these cases and splits them accordingly

  1. 01/01 B-DATE 1995 O
  2. 1995 O 0101 B-DATE