BCHSI / philter-ucsf

Open source clinical text de-identification
BSD 3-Clause "New" or "Revised" License
107 stars 50 forks source link

Is it possible to use Philter with non-english language text #12

Open ewartj opened 2 years ago

ewartj commented 2 years ago

Hi thanks for releasing this software. I was just wondering is there anyway of enabling Philter to process non-english text?

I had a quick try using default settings (python main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk") and it seems to anonymise everything by default. For example:

This:

"pitkävuoroHengitys :alkuun vm <NAME> <NHS_NUMB> 4890262253 <NHS_NUMB> Margot <NAME> 40% , co2 nousee , vaihdettu 28% <NI_NUMB> <ADDRESS> 0487 Hull Village Suite 759, New Donald <ADDRESS>, <POSTCODE> EX13 5LY <POSTCODE> KK218196A <NI_NUMB> , jolla saturaatio laskee ad 84 ja co2 edelleen nousee , viikset , joilla saturoituu 90-91.<NAME> <ADDRESS> 94892 Garcia Cliffs, Thomasville <ADDRESS>, <POSTCODE> PO41 0SD <POSTCODE> <NHS_NUMB> <NI_NUMB> CJ389083D <NI_NUMB> 4890262253 <NHS_NUMB> Ibrahim <NAME> Hengitys pinnallista ja krohisevaa.

Became:

****ä************* :****** <NHS_NUMB> ********** <NHS_NUMB> <ADDRESS> **** **** ******* ***** ***, New ****** <ADDRESS>, <POSTCODE> *** 6BP <POSTCODE> vm 40% , co2 ****** , <NAME> ******** <NAME> ********* 28% , ***** <**_NUMB> ********* <**_NUMB> ********** ****** ad ** ** *** ******** ****** , ******* , ****** ********** 90-91.<ADDRESS> ***** ****** ******, Thomasville <ADDRESS>, <POSTCODE> *** 7BE <POSTCODE> <**_NUMB> ********* <**_NUMB> <NHS_NUMB> ********** <NHS_NUMB> ******** <NAME> **** <NAME> *********** ** **********.

Is there a way of modifying this so that only the regex patterns are anonymised?

RedChrists commented 2 years ago

Philter is essentially a whitelist approach. That means everything unknown is redacted by default. You would need to translate (or re-create) everything in the config file and it's patterns to the non-English language. Do-able, but a lot of work and would need testing of course.