In to_conll.py processing code for I2B2 2006 DEID data, many Date examples are discarded and cast as 'O' instead of 'B-DATE' and still contain __phi__ as tokens in the processed tsv files. In the deid_surrogate_train_all_version2.xml file, there were 5167 total TYPE="DATE" markers, but in the processed train.tsv I got after running to_conll.py, 3035 of these examples became cast as 'O' and had the __phi__ token.
I believe the script fails when there's a PHI XML object with test before or after it, but no space. Broadly, these two types of cases:
<PHI TYPE="DATE">01/01</PHI>/1995 or <PHI TYPE="DATE">01-01</PHI>-95
1995<PHI TYPE="DATE">0101</PHI>
Updated to_conll.py script checks for these cases and splits them accordingly
In
to_conll.py
processing code for I2B2 2006 DEID data, many Date examples are discarded and cast as 'O' instead of 'B-DATE' and still contain__phi__
as tokens in the processed tsv files. In thedeid_surrogate_train_all_version2.xml
file, there were 5167 totalTYPE="DATE"
markers, but in the processed train.tsv I got after runningto_conll.py
, 3035 of these examples became cast as 'O' and had the__phi__
token.I believe the script fails when there's a PHI XML object with test before or after it, but no space. Broadly, these two types of cases:
<PHI TYPE="DATE">01/01</PHI>/1995 or <PHI TYPE="DATE">01-01</PHI>-95
1995<PHI TYPE="DATE">0101</PHI>
Updated
to_conll.py
script checks for these cases and splits them accordingly