SMI / dicompixelanon

DICOM Pixel Anonymisation
3 stars 0 forks source link

dicompixelanon

Anonymisation of text burned into the pixels of DICOM images. This software has been used on the complete archive of a whole national population across a variety of modalities (CT, MR, CR, DX, etc) and has proven highly effective.

This repo contains a full suite of software for

It also contains software which can be used to create dummy or synthetic DICOM files based on originals, changing only the content of the image frames not the metadata.

What it does not do: anonymise the metadata in the DICOM tags; this is best left to other tools (see CTP for example).

Contents:

Utilities:

Usage

Environment variables

Setup

Update

git pull
cp data/ocr_allowlist_regex.txt $SMI_ROOT/data/dicompixelanon/
cp data/deid.dicom.smi $SMI_ROOT/data/deid/
cd src/library
python3 ./setup.py bdist_wheel
pip install $(ls dist/*whl|tail -1)

Run

Now you can run the applications:

See below for a suggested workflow.

Sample data

Some sample data is provided as part of the GDCM repo:

Useful sample files:

Requirements

Before installing these requirements please read the Installation Notes below.

Python requirements

Optional Python requirements

OS packages

Installation notes

pytorch

Before installing the requirements from requirements.txt you must install the CPU version of PyTorch if you don't have a GPU available:

pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu

pydicom

pydicom has some additional packages which need to be installed. To handle compressed images you need to install pylibjpeg and pylibjpeg_libjpeg. See the tables in the pydicom documentation: https://pydicom.github.io/pydicom/stable/old/image_data_handlers.html#supported-transfer-syntaxes

pytesseract

PyTesseract must be pinned to version 0.3.8 if you are stuck with Python 3.6 (as found in CentOS-7). See also tesseract below.

Stanford NER

Stanford NER (the original CoreNLP, not Stanza) requires Java 1.8. It can be made to work with Java 9 and Java 10 but will not work with Java 11 because a dependency has been removed from the JRE.

easyocr

The easyocr model hub is https://www.jaided.ai/easyocr/modelhub/ Download the English model from https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip and the text detection model from https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip Unpack the zip files and copy the .pth files into $SMI_ROOT/data/easyocr

spacy

You might need to specify a version when installing spacy because the most recent version on pypi (a dev version of 4.0.0) does not have the language models available yet. For example pip install spacy==3.6.0

Inside your virtual environment run python -m spacy download en_core_web_trf

tesseract

Download the file eng.traineddata from https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata and copy it to $SMI_ROOT/data/tessdata

flair

Download the file pytorch_model.bin from https://huggingface.co/flair/ner-english, copy it to $SMI_ROOT/data/flair/models/ner-english/ and make a symlink from 4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f and/or from 4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4

stanford

Download the repo https://github.com/philipperemy/Stanford-NER-Python and run init.sh to unpack the zip to the stanford-ner directory. Copy the contents of the stanford-ner directory into $SMI_ROOT/data/stanford_ner/ Note that this includes the CoreNLP Java software which needs Java 1.8 (possibly also 9 and 10 but it is not compatible with Java 11).

stanza

Download the models from https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip Unpack default.zip into $SMI_ROOT/data/stanza/en/

Windows installation

Notes:

Create the virtual environment (venv) using your preferred version of Python, for example use one of these:

python -m venv c:\tmp\venv
C:\Program Files\Python310\python.exe -m venv c:\tmp\venv
C:\Users\Guneet\AppData\Local\Programs\Python\Python310\python.exe -m venv c:\tmp\venv
c:\tmp\venv\Scripts\activate.bat
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
pip install --prefer-binary pydicom pydal easyocr numpy Pillow spacy flair pylibjpeg pylibjpeg_libjpeg --only-binary=sentencepiece
python -m spacy download en_core_web_trf
cd c:\tmp
git clone https://github.com/SMI/SmiServices
git clone https://github.com/SMI/StructuredReports
git clone https://github.com/SMI/dicompixelanon
pip install --prefer-binary -r c:\tmp\StructuredReports\src\library\requirements.txt
pip install --prefer-binary --no-binary=deid -r c:\tmp\dicompixelanon\src\library\requirements.txt --no-binary=deid
cd c:\tmp\StructuredReports\src\library
python .\setup.py install
cd c:\tmp\dicompixelanon\src\library
python .\setup.py install
cd c:\tmp\dicompixelanon\src\applications
set SMI_ROOT=c:\tmp\SmiServices
python dcmaudit.py -i C:\tmp\SmiServices\tests\common\Smi.Common.Tests\TestData\*.dcm

Workflow

A suggested workflow for producing rules to anonymise a consistent set of DICOM files:

A suggested workflow for testing OCR on a whole Modality: