Anonymisation of text burned into the pixels of DICOM images. This software has been used on the complete archive of a whole national population across a variety of modalities (CT, MR, CR, DX, etc) and has proven highly effective.
This repo contains a full suite of software for
It also contains software which can be used to create dummy or synthetic DICOM files based on originals, changing only the content of the image frames not the metadata.
What it does not do: anonymise the metadata in the DICOM tags; this is best left to other tools (see CTP for example).
Contents:
Utilities:
dcmaudit.py
- interactive GUI to mark rectangles for redaction in DICOM image frames and overlaysdicom_redact_db.py
- redact every file in the database which has rectanglesextract_all.py
- extract as JSON every document from every image modality in MongoDBextract_BIA.py
- extract all the DICOM tags relevant to annotations, overlays, frames from every document from every image modality in MongoDBcsv_groupby_filter.py
- group CSV rows and output a selection from each groupsummary.py
- report a count of the unique values in each column of the CSVsummary_overlay.py
- print the overlay-related columns from the CSVrandom_combinations.sh
- run random_combinations.py
for every image modality CSV filerandom_combinations.py
- read a CSV file and output a randomly-selected set of lines for each of every combination of values in a given set of columnsrandom_combinations_files.py
- convert the output from random_combinations.py
into a set of filenamesocr_files_parallel.sh
- run two OCR on output of random_combinations.shpydicom_images.py
- extract all the image frames, overlays, overlay frames as PNG format from a DICOM file, optionally run through OCR to get text, optionally run that through NER to get PIIdbrects.sh
- display the rectangles in the database (simple sqlite3 wrapper)dbtext.sh
- display the OCR text in the database (simple sqlite3 wrapper)dbtags.sh
- display the table of files marked as Done in the database (simple sqlite3 wrapper)dbtagged.sh
- display the filenames marked as Done in the database (simple sqlite3 wrapper)dbtext_for_tagged.sh
- display OCR details of files marked as Donedbrects_for_tagged.sh
- display rectangles of files marked as Donedbrects_to_deid_rules.py
- convert rectangles from files marked as Done into deid rulesdicomls.py
- simply list all DICOM tags and values from a filedicom_pixel_anon.sh
- anonymise a DICOM by running OCR and redacting all rectanglesbuild_allowlist.py
- create list of regex rules for allowlisting OCR output and write to file, optionally reduce the number of rules by 20 percent (leading to more redactions of non-PII data, but significantly shorter runtime)$SMI_ROOT
- this will be used to find data and configuration files$PACS_ROOT
- this will be used to find DICOM files (e.g. if a path to a
DICOM file is relative, and the file cannot be found, then PACS_ROOT will be
prepended)export HF_HUB_OFFLINE=1
if using flair
inside a safe haven without
internet access, to prevent it from trying to download models from huggingface
(and crashing when it can't connect).PYTHONPATH=../../library/
if you want to try any of the applications
from their directory without building and installing the library$SMI_ROOT/data
(you can set $SMI_ROOT
anywhere)data/ocr_allowlist_regex.txt
into $SMI_ROOT/data/dicompixelanon/ocr_allowlist_regex.txt
if required for dicom_redactdata/deid.dicom.smi
into $SMI_ROOT/data/deid/deid.dicom.smi
scannedforms_model.pth
into $SMI_ROOT/data/dicompixelanon
src/library
directorygit pull
cp data/ocr_allowlist_regex.txt $SMI_ROOT/data/dicompixelanon/
cp data/deid.dicom.smi $SMI_ROOT/data/deid/
cd src/library
python3 ./setup.py bdist_wheel
pip install $(ls dist/*whl|tail -1)
Now you can run the applications:
See below for a suggested workflow.
Some sample data is provided as part of the GDCM repo:
Useful sample files:
gdcm-US-ALOKA-16.dcm
- has Sequence of Ultrasound Regions (3) plus text within the image regionsUS-GE-4AICL142.dcm
- has SequenceOfUltrasoundRegionsCT_OSIRIX_OddOverlay.dcm
- has 1 overlayXA_GE_JPEG_02_with_Overlays.dcm
- has 8 overlays in high bitsPHILIPS_Brilliance_ExtraBytesInOverlay.dcm
- has 1 overlayMR-SIEMENS-DICOM-WithOverlays.dcm
- has separate overlaysGE_DLX-8-MONO2-Multiframe.dcm
- has multiple framesBefore installing these requirements please read the Installation Notes below.
Python requirements
sqlite
format)Optional Python requirements
OS packages
Before installing the requirements from requirements.txt
you must install the CPU version of PyTorch if you don't have a GPU available:
pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
pydicom has some additional packages which need to be installed.
To handle compressed images you need to install pylibjpeg
and pylibjpeg_libjpeg
.
See the tables in the pydicom
documentation:
https://pydicom.github.io/pydicom/stable/old/image_data_handlers.html#supported-transfer-syntaxes
PyTesseract must be pinned to version 0.3.8 if you are stuck with Python 3.6 (as found in CentOS-7). See also tesseract below.
Stanford NER (the original CoreNLP, not Stanza) requires Java 1.8. It can be made to work with Java 9 and Java 10 but will not work with Java 11 because a dependency has been removed from the JRE.
The easyocr model hub is https://www.jaided.ai/easyocr/modelhub/
Download the English model from https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip
and the text detection model from https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip
Unpack the zip files and copy the .pth
files into $SMI_ROOT/data/easyocr
You might need to specify a version when installing spacy because the most
recent version on pypi (a dev version of 4.0.0) does not have the language
models available yet. For example pip install spacy==3.6.0
Inside your virtual environment run python -m spacy download en_core_web_trf
Download the file eng.traineddata
from
https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
and copy it to $SMI_ROOT/data/tessdata
Download the file pytorch_model.bin
from https://huggingface.co/flair/ner-english,
copy it to $SMI_ROOT/data/flair/models/ner-english/
and make a symlink from 4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f
and/or from 4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
Download the repo https://github.com/philipperemy/Stanford-NER-Python
and run init.sh
to unpack the zip to the stanford-ner
directory.
Copy the contents of the stanford-ner
directory into $SMI_ROOT/data/stanford_ner/
Note that this includes the CoreNLP Java software which needs Java 1.8
(possibly also 9 and 10 but it is not compatible with Java 11).
Download the models from https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip
Unpack default.zip
into $SMI_ROOT/data/stanza/en/
Notes:
pip install scikit-build
pip install spacy==3.6.0
--prefer-binary
option (or --only-binary :all:
)pip install --prefer-binary matplotlib
This is caused by an old binary version of deid asking for an old version of matplotlib.dicompixelanon\src\library\requirements.txt
Create the virtual environment (venv) using your preferred version of Python, for example use one of these:
python -m venv c:\tmp\venv
C:\Program Files\Python310\python.exe -m venv c:\tmp\venv
C:\Users\Guneet\AppData\Local\Programs\Python\Python310\python.exe -m venv c:\tmp\venv
c:\tmp\venv\Scripts\activate.bat
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
pip install --prefer-binary pydicom pydal easyocr numpy Pillow spacy flair pylibjpeg pylibjpeg_libjpeg --only-binary=sentencepiece
python -m spacy download en_core_web_trf
cd c:\tmp
git clone https://github.com/SMI/SmiServices
git clone https://github.com/SMI/StructuredReports
git clone https://github.com/SMI/dicompixelanon
pip install --prefer-binary -r c:\tmp\StructuredReports\src\library\requirements.txt
pip install --prefer-binary --no-binary=deid -r c:\tmp\dicompixelanon\src\library\requirements.txt --no-binary=deid
cd c:\tmp\StructuredReports\src\library
python .\setup.py install
cd c:\tmp\dicompixelanon\src\library
python .\setup.py install
cd c:\tmp\dicompixelanon\src\applications
set SMI_ROOT=c:\tmp\SmiServices
python dcmaudit.py -i C:\tmp\SmiServices\tests\common\Smi.Common.Tests\TestData\*.dcm
A suggested workflow for producing rules to anonymise a consistent set of DICOM files:
dcm_audit.py
and redact the PII in one of the images, it will be saved in the databasedbrects_to_deid_rules.py
to create deid rules which will automatically redact all DICOM
files which match the Manufacturer etc rules.dicom_redact.py
, you won't need the database.A suggested workflow for testing OCR on a whole Modality: