SMI / dicompixelanon

DICOM Pixel Anonymisation
3 stars 0 forks source link

dicom_redact - implement whitelist #7

Closed howff closed 1 year ago

howff commented 1 year ago

Don't redact common non-PII words such as patient orientation. Since dicom_redact doesn't know what text it is redacting then ideally we would use #2 to store the text in the database. The alternative is to modify dicom_ocr to ignore common words and not add them to the database.

Consider the likelihood of improvements to the OCR algorithm vs additions to (or removals from) the common words whitelist.

howff commented 1 year ago

Running OCR on 10,000 images and collating the ten most common:

487 PA
466 AP ERECT
422 AP
374 HBL
332 ERECT
286 PA ERECT
266 RED DOT
190 MOBILE
180 SUPINE
157 WEIGHT BEARING
howff commented 1 year ago

A more comprehensive list has been put into Sharepoint and has been supplied to Kara to see if she can improve the current implementation of ocr_allow_list here: https://github.com/SMI/dicompixelanon/blob/main/src/dicom_redact.py#L53

howff commented 1 year ago

Kara has implemented a whitelist creation script that generates a set of regex patterns. These have been added to the repo.

Using these regex patterns has been implemented in the NER class, so you simply

ner = NER('ocr_whitelist')
whitelisted = True if ner.detect('text') == []