BCHSI / philter-ucsf

Open source clinical text de-identification
BSD 3-Clause "New" or "Revised" License
111 stars 49 forks source link

Hotfix for multiple bugs #7

Closed tschaffter closed 4 years ago

tschaffter commented 4 years ago

Overview

This PR describes issues encountered when using Philter to detect PHI information in i2b2 clinical notes.

Protocol

First clone the repo and create a Python virtual environment.

cd philter-ucsf
python3 -m venv env
source env/bin/activate

Here is the version of python and pip.

$ python --version
Python 3.7.7
$ pip --version
pip 19.2.3 from /Users/tschaffter/dev/philter-ucsf/env/lib/python3.7/site-packages/pip (python 3.7)

We attempt to detect PHI information from the example notes in txt format included in data/i2b2_notes:

$ python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat=i2b2
Traceback (most recent call last):
  File "main.py", line 5, in <module>
    from philter import Philter
  File "/Users/tschaffter/dev/philter-ucsf/philter.py", line 5, in <module>
    import nltk
ModuleNotFoundError: No module named 'nltk'

There are few packages that must be installed. Unfortunately there is no requirements.txt to easily install them. After a few tries, here is the dependencies that need to be installed.

pip install nltk chardet numpy

Here are the packages installed in the Python environment:

$ pip freeze
chardet==3.0.4
click==7.1.2
joblib==0.15.1
nltk==3.5
numpy==1.19.0
regex==2020.6.8
tqdm==4.46.1
tschaffter commented 4 years ago

From the project folder, the dependencies can now be installed using:

pip3 install -r requirements.txt
tschaffter commented 4 years ago

We attempt to run the following command to detect PHI information in i2b2 clinical notes in text format provided as example in this repository.

$ python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat=i2b2
FutureWarning: Possible nested set at position 772 in file filters/regex/safe/hospital_safe.txt
reading text from ./data/i2b2_notes/111-01.txt
reading text from ./data/i2b2_notes/110-04.txt
reading text from ./data/i2b2_notes/110-03.txt
reading text from ./data/i2b2_notes/110-02.txt
reading text from ./data/i2b2_notes/110-01.txt
['<?xml version="1.0" ?>\n', '<Philter>\n', '<TEXT><![CDATA[', '\n\n\nRecord date: 2083-07-20\n\n                     SILVER RIDGE EMERGENCY DEPT VISIT\n\n \n\nOROZCO,KYLE   560-40-78-5                     VISIT DATE: 07/20/83\n\nPRESENTING COMPLAINT:  Groin abscess. \n\nHISTORY OF PRESENTING COMPLAINT:  This is a 58 year-old male who \n\nhad a renal catheterization via the right groin for renal artery \n\nstenosis on 8-09-83, who now comes in with progressive redness, \n\nswelling and some drainage over the last two days.  Some low grade \n\nfever, no chills, no rigors, no cough, no chest pain. \n\nPAST MEDICAL HISTORY:  Angioplasty of his renal artery, insulin \n\ndependent diabetes mellitus, hypertension. \n\nMEDICATIONS:  Zestril, Zocor, hydrochlorothiazide, insulin. \n\nALLERGIES:  No known drug allergies. \n\nPHYSICAL EXAMINATION:  This is a well-nourished, well-developed \n\nmale.  SKIN:  Warm and dry without rash or diaphoresis.  HEENT: \n\nNormocephalic, atraumatic, pupils are equal, round and reactive to \n\nlight.  NECK:  Supple, full range of motion.  LUNGS:  Clear. \n\nHEART:  Regular rate and rhythm.  ABDOMEN:  Soft and nontender. \n\nThe right groin has a purulent abscess which is extremely tender. \n\nIt is not pulsatile.  There is purulent drainage from it.  There is \n\nsome surrounding erythema that extends on to the testicles.  The \n\ntesticles are nontender. \n\nCONSULTATIONS (including PCP):  I have discussed the case with the \n\nprimary care physician. \n\nFINAL DIAGNOSIS:  Groin abscess. \n\nDISPOSITION (including condition upon discharge):  The patient is \n\nadmitted to the operating room in stable condition. \n\n___________________________________                    XW277/90683 \n\nFILBERT BRIGHT, M.D.    FB59                     D:07/20/83 \n\n                                                       T:07/20/83 \n\nDictated by:  FILBERT BRIGHT, M.D.    FB59 \n\n          Not reviewed by Attending Physician         \n\n\n\n\n\n', ']]></TEXT>\n', '<TAGS>\n', '<', 'DATE', ' id="P', '0', '" start="', '16', '" end="', '26', '" text="', '2083-07-20', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '1', '" start="', '145', '" end="', '153', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '2', '" start="', '338', '" end="', '348', '" text="', 'on 8-09-83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '3', '" start="', '1605', '" end="', '1609', '" text="', '7/90', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '4', '" start="', '1666', '" end="', '1674', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'DATE', ' id="P', '5', '" start="', '1734', '" end="', '1742', '" text="', '07/20/83', '" TYPE="', 'DATE', '" comment="" />\n', '<', 'OTHER', ' id="P', '6', '" start="', '16', '" end="', '26', '" text="', '2083-07-20', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '7', '" start="', '56', '" end="', '61', '" text="', 'RIDGE', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '8', '" start="', '62', '" end="', '71', '" text="', 'EMERGENCY', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '9', '" start="', '87', '" end="', '93', '" text="', 'OROZCO', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '10', '" start="', '94', '" end="', '98', '" text="', 'KYLE', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '11', '" start="', '101', '" end="', '112', '" text="', '560-40-78-5', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '12', '" start="', '229', '" end="', '233', '" text="', 'This', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '13', '" start="', '341', '" end="', '348', '" text="', '8-09-83', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '14', '" start="', '1601', '" end="', '1612', '" text="', 'XW277/90683', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '15', '" start="', '1615', '" end="', '1630', '" text="', 'FILBERT BRIGHT,', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '16', '" start="', '1639', '" end="', '1643', '" text="', 'FB59', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '17', '" start="', '1759', '" end="', '1774', '" text="', 'FILBERT BRIGHT,', '" TYPE="', 'OTHER', '" comment="" />\n', '<', 'OTHER', ' id="P', '18', '" start="', '1783', '" end="', '1787', '" text="', 'FB59', '" TYPE="', 'OTHER', '" comment="" />\n', '</TAGS>\n', '</Philter>\n']
Traceback (most recent call last):
  File "main.py", line 138, in <module>
    main()
  File "main.py", line 117, in main
    filterer.transform()
  File "/Users/tschaffter/dev/philter-ucsf/philter.py", line 802, in transform
    f.write(contents)
TypeError: write() argument must be str, not None
tschaffter commented 4 years ago

The above commit fixes the issue. It also comment out a print() call to make the content of stdout more concise.

$ python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat=i2b2
FutureWarning: Possible nested set at position 772 in file filters/regex/safe/hospital_safe.txt
reading text from ./data/i2b2_notes/111-01.txt
reading text from ./data/i2b2_notes/110-04.txt
reading text from ./data/i2b2_notes/110-03.txt
reading text from ./data/i2b2_notes/110-02.txt
reading text from ./data/i2b2_notes/110-01.txt

As an example, here are the PHI tokens detected by Philter in the clinical note ./data/i2b2_notes/111-01.txt. The results is saved to ./data/i2b2_results/111-01.xml.

<TAGS>
<DATE id="P0" start="16" end="26" text="2069-04-07" TYPE="DATE" comment="" />
<DATE id="P1" start="89" end="97" text="November" TYPE="DATE" comment="" />
<DATE id="P2" start="1708" end="1716" text="04/07/69" TYPE="DATE" comment="" />
<DATE id="P3" start="1721" end="1729" text="04/15/69" TYPE="DATE" comment="" />
<DATE id="P4" start="1734" end="1742" text="04/07/69" TYPE="DATE" comment="" />
<OTHER id="P5" start="16" end="26" text="2069-04-07" TYPE="OTHER" comment="" />
<OTHER id="P6" start="38" end="46" text="Villegas" TYPE="OTHER" comment="" />
<OTHER id="P7" start="89" end="97" text="November" TYPE="OTHER" comment="" />
<OTHER id="P8" start="1382" end="1386" text="Will" TYPE="OTHER" comment="" />
<OTHER id="P9" start="1666" end="1685" text="Xzavian G. Tavares," TYPE="OTHER" comment="" />
</TAGS>
tschaffter commented 4 years ago

Hey @kmuenzen, I have a few questions:

image

In comparison, here are the annotations included in the gold standard files of the evaluation set.

image

tschaffter commented 4 years ago

Taking a shot at the issue #1

First, install pandas that is required. Added pandas to requirements.txt added in this PR.

Running this command copied/pasted from README throws an error:

$ python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/
Traceback (most recent call last):
  File "improve_i2b2_notes.py", line 7, in <module>
    import xmltodict
ModuleNotFoundError: No module named 'xmltodict'

This can be fixed by installing xmltodict (added to requirements.txt). The new error is now:

$ python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/
Output directory already exists.

Curating: 111-01.xml
Traceback (most recent call last):
  File "improve_i2b2_notes.py", line 194, in <module>
    main()
  File "improve_i2b2_notes.py", line 146, in main
    for key, value in tags_dict.iteritems():
AttributeError: 'collections.OrderedDict' object has no attribute 'iteritems'
tschaffter commented 4 years ago

This PR is ready for review (I've removed the tag "work in progress" I've added at the time of creating this PR).

kmuenzen commented 4 years ago

Hi @tschaffter,

Thanks so much for your great suggestions! I'll merge your pull request shortly. In response to a few of your questions:

  • I have applied Philter to all the clinical notes in the evaluation set of the 2014 i2b2 NLP De-id challlenge (https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/). When looking at all the annotation generated by Philter (i.e., children XML nodes of ), Philters only reports the detection of PHI types DATE and OTHER. I am particularly interested in the detection of NAMEs. Am I doing something wrong?

The PHI tags assigned by Philter originate from the configuration file, specified by the --filters (-f) argument. Each filter object in the config file has a "phi_type" attribute where you can define the inferred PHI type of any text matched by that particular filter. We originally included only the "DATE" and "OTHER" PHI types because our date filters had a very low false positive rate, whereas other filters had varying false positive rates and sometimes tagged PHI inappropriately. However, I've created another config file (configs/philter_delta_phi_tags/json) that has additional PHI types like NAME, LOCATION, and ID included in case you would like to use that instead. Again, the specificity of these tags may vary, but you are more than welcome to give it a try. Please note that if you would like to add/modify PHI tags, you will need to add these new tags to a pre-defined list in Line 112 of philter.py. Here is the current list:

self.phi_type_list = ['DATE','Patient_Social_Security_Number','Email', \
'Provider_Address_or_Location','Age','Name','OTHER','ID','NAME','LOCATION', \
'CONTACT','AGE']
  • There are several source code files that appear in both the project folder and the sub-folder philter_ucsf. Whay is that so?

The files in the philter_ucsf subdirectory are the source files for the PyPi package. I had to make slight modifications to main.py and philter.py to make the pip version of Philter functional.