Closed howff closed 1 year ago
Running OCR on 10,000 images and collating the ten most common:
487 PA
466 AP ERECT
422 AP
374 HBL
332 ERECT
286 PA ERECT
266 RED DOT
190 MOBILE
180 SUPINE
157 WEIGHT BEARING
A more comprehensive list has been put into Sharepoint and has been supplied to Kara to see if she can improve the current implementation of ocr_allow_list
here: https://github.com/SMI/dicompixelanon/blob/main/src/dicom_redact.py#L53
Kara has implemented a whitelist creation script that generates a set of regex patterns. These have been added to the repo.
Using these regex patterns has been implemented in the NER class, so you simply
ner = NER('ocr_whitelist')
whitelisted = True if ner.detect('text') == []
Don't redact common non-PII words such as patient orientation. Since dicom_redact doesn't know what text it is redacting then ideally we would use #2 to store the text in the database. The alternative is to modify dicom_ocr to ignore common words and not add them to the database.
Consider the likelihood of improvements to the OCR algorithm vs additions to (or removals from) the common words whitelist.