Closed tschaffter closed 4 years ago
From the project folder, the dependencies can now be installed using:
pip3 install -r requirements.txt
We attempt to run the following command to detect PHI information in i2b2 clinical notes in text format provided as example in this repository.
$ python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat=i2b2
FutureWarning: Possible nested set at position 772 in file filters/regex/safe/hospital_safe.txt
reading text from ./data/i2b2_notes/111-01.txt
reading text from ./data/i2b2_notes/110-04.txt
reading text from ./data/i2b2_notes/110-03.txt
reading text from ./data/i2b2_notes/110-02.txt
reading text from ./data/i2b2_notes/110-01.txt
['<?xml version="1.0" ?>\n', '<Philter>\n', '<TEXT><![CDATA[', '\n\n\nRecord date: 2083-07-20\n\n SILVER RIDGE EMERGENCY DEPT VISIT\n\n \n\nOROZCO,KYLE 560-40-78-5 VISIT DATE: 07/20/83\n\nPRESENTING COMPLAINT: Groin abscess. \n\nHISTORY OF PRESENTING COMPLAINT: This is a 58 year-old male who \n\nhad a renal catheterization via the right groin for renal artery \n\nstenosis on 8-09-83, who now comes in with progressive redness, \n\nswelling and some drainage over the last two days. Some low grade \n\nfever, no chills, no rigors, no cough, no chest pain. \n\nPAST MEDICAL HISTORY: Angioplasty of his renal artery, insulin \n\ndependent diabetes mellitus, hypertension. \n\nMEDICATIONS: Zestril, Zocor, hydrochlorothiazide, insulin. \n\nALLERGIES: No known drug allergies. \n\nPHYSICAL EXAMINATION: This is a well-nourished, well-developed \n\nmale. SKIN: Warm and dry without rash or diaphoresis. HEENT: \n\nNormocephalic, atraumatic, pupils are equal, round and reactive to \n\nlight. NECK: Supple, full range of motion. LUNGS: Clear. \n\nHEART: Regular rate and rhythm. ABDOMEN: Soft and nontender. \n\nThe right groin has a purulent abscess which is extremely tender. \n\nIt is not pulsatile. There is purulent drainage from it. There is \n\nsome surrounding erythema that extends on to the testicles. The \n\ntesticles are nontender. \n\nCONSULTATIONS (including PCP): I have discussed the case with the \n\nprimary care physician. \n\nFINAL DIAGNOSIS: Groin abscess. \n\nDISPOSITION (including condition upon discharge): The patient is \n\nadmitted to the operating room in stable condition. \n\n___________________________________ XW277/90683 \n\nFILBERT BRIGHT, M.D. FB59 D:07/20/83 \n\n T:07/20/83 \n\nDictated by: FILBERT BRIGHT, M.D. FB59 \n\n Not reviewed by Attending Physician \n\n\n\n\n\n', ']]></TEXT>\n', '<TAGS>\n', '<', 'DATE', ' id="P', '0', '" start="', '16', '" end="', '26', '" text="', '2083-07-20', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '1', '" start="', '145', '" end="', '153', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '2', '" start="', '338', '" end="', '348', '" text="', 'on 8-09-83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '3', '" start="', '1605', '" end="', '1609', '" text="', '7/90', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '4', '" start="', '1666', '" end="', '1674', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '5', '" start="', '1734', '" end="', '1742', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'OTHER', ' id="P', '6', '" start="', '16', '" end="', '26', '" text="', '2083-07-20', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '7', '" start="', '56', '" end="', '61', '" text="', 'RIDGE', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '8', '" start="', '62', '" end="', '71', '" text="', 'EMERGENCY', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '9', '" start="', '87', '" end="', '93', '" text="', 'OROZCO', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '10', '" start="', '94', '" end="', '98', '" text="', 'KYLE', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '11', '" start="', '101', '" end="', '112', '" text="', '560-40-78-5', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '12', '" start="', '229', '" end="', '233', '" text="', 'This', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '13', '" start="', '341', '" end="', '348', '" text="', '8-09-83', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '14', '" start="', '1601', '" end="', '1612', '" text="', 'XW277/90683', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '15', '" start="', '1615', '" end="', '1630', '" text="', 'FILBERT BRIGHT,', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '16', '" start="', '1639', '" end="', '1643', '" text="', 'FB59', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '17', '" start="', '1759', '" end="', '1774', '" text="', 'FILBERT BRIGHT,', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '18', '" start="', '1783', '" end="', '1787', '" text="', 'FB59', '" TYPE="', 'OTHER', '" comment="" />\n', '</TAGS>\n', '</Philter>\n']
Traceback (most recent call last):
File "main.py", line 138, in <module>
main()
File "main.py", line 117, in main
filterer.transform()
File "/Users/tschaffter/dev/philter-ucsf/philter.py", line 802, in transform
f.write(contents)
TypeError: write() argument must be str, not None
The above commit fixes the issue. It also comment out a print()
call to make the content of stdout more concise.
$ python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat=i2b2
FutureWarning: Possible nested set at position 772 in file filters/regex/safe/hospital_safe.txt
reading text from ./data/i2b2_notes/111-01.txt
reading text from ./data/i2b2_notes/110-04.txt
reading text from ./data/i2b2_notes/110-03.txt
reading text from ./data/i2b2_notes/110-02.txt
reading text from ./data/i2b2_notes/110-01.txt
As an example, here are the PHI tokens detected by Philter in the clinical note ./data/i2b2_notes/111-01.txt
. The results is saved to ./data/i2b2_results/111-01.xml
.
<TAGS>
<DATE id="P0" start="16" end="26" text="2069-04-07" TYPE="DATE" comment="" />
<DATE id="P1" start="89" end="97" text="November" TYPE="DATE" comment="" />
<DATE id="P2" start="1708" end="1716" text="04/07/69" TYPE="DATE" comment="" />
<DATE id="P3" start="1721" end="1729" text="04/15/69" TYPE="DATE" comment="" />
<DATE id="P4" start="1734" end="1742" text="04/07/69" TYPE="DATE" comment="" />
<OTHER id="P5" start="16" end="26" text="2069-04-07" TYPE="OTHER" comment="" />
<OTHER id="P6" start="38" end="46" text="Villegas" TYPE="OTHER" comment="" />
<OTHER id="P7" start="89" end="97" text="November" TYPE="OTHER" comment="" />
<OTHER id="P8" start="1382" end="1386" text="Will" TYPE="OTHER" comment="" />
<OTHER id="P9" start="1666" end="1685" text="Xzavian G. Tavares," TYPE="OTHER" comment="" />
</TAGS>
Hey @kmuenzen, I have a few questions:
DATE
and OTHER
. I am particularly interested in the detection of NAMEs. Am I doing something wrong?In comparison, here are the annotations included in the gold standard files of the evaluation set.
philter_ucsf
. Whay is that so?Taking a shot at the issue #1
First, install pandas
that is required. Added pandas
to requirements.txt
added in this PR.
Running this command copied/pasted from README throws an error:
$ python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/
Traceback (most recent call last):
File "improve_i2b2_notes.py", line 7, in <module>
import xmltodict
ModuleNotFoundError: No module named 'xmltodict'
This can be fixed by installing xmltodict
(added to requirements.txt
). The new error is now:
$ python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/
Output directory already exists.
Curating: 111-01.xml
Traceback (most recent call last):
File "improve_i2b2_notes.py", line 194, in <module>
main()
File "improve_i2b2_notes.py", line 146, in main
for key, value in tags_dict.iteritems():
AttributeError: 'collections.OrderedDict' object has no attribute 'iteritems'
This PR is ready for review (I've removed the tag "work in progress" I've added at the time of creating this PR).
Hi @tschaffter,
Thanks so much for your great suggestions! I'll merge your pull request shortly. In response to a few of your questions:
- I have applied Philter to all the clinical notes in the evaluation set of the 2014 i2b2 NLP De-id challlenge (https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/). When looking at all the annotation generated by Philter (i.e., children XML nodes of ), Philters only reports the detection of PHI types
DATE
andOTHER
. I am particularly interested in the detection of NAMEs. Am I doing something wrong?
The PHI tags assigned by Philter originate from the configuration file, specified by the --filters (-f) argument. Each filter object in the config file has a "phi_type" attribute where you can define the inferred PHI type of any text matched by that particular filter. We originally included only the "DATE" and "OTHER" PHI types because our date filters had a very low false positive rate, whereas other filters had varying false positive rates and sometimes tagged PHI inappropriately. However, I've created another config file (configs/philter_delta_phi_tags/json) that has additional PHI types like NAME, LOCATION, and ID included in case you would like to use that instead. Again, the specificity of these tags may vary, but you are more than welcome to give it a try. Please note that if you would like to add/modify PHI tags, you will need to add these new tags to a pre-defined list in Line 112 of philter.py. Here is the current list:
self.phi_type_list = ['DATE','Patient_Social_Security_Number','Email', \
'Provider_Address_or_Location','Age','Name','OTHER','ID','NAME','LOCATION', \
'CONTACT','AGE']
- There are several source code files that appear in both the project folder and the sub-folder philter_ucsf. Whay is that so?
The files in the philter_ucsf subdirectory are the source files for the PyPi package. I had to make slight modifications to main.py and philter.py to make the pip version of Philter functional.
Overview
This PR describes issues encountered when using Philter to detect PHI information in i2b2 clinical notes.
Protocol
First clone the repo and create a Python virtual environment.
Here is the version of
python
andpip
.We attempt to detect PHI information from the example notes in txt format included in
data/i2b2_notes
:There are few packages that must be installed. Unfortunately there is no
requirements.txt
to easily install them. After a few tries, here is the dependencies that need to be installed.Here are the packages installed in the Python environment: