Tooling to determine original dcm file from anonymised path

SMI / SmiServices

Scale-able loading, linking and anonymisation of DICOM images for healthcare research environments (e.g. Safe Havens)

GNU General Public License v3.0

21 stars 5 forks source link

Tooling to determine original dcm file from anonymised path #1280

Open rkm opened 2 years ago

rkm commented 2 years ago

When investigating issues with an anonymised file in an extraction, it is often useful to review the original file for comparison. This is currently difficult to do as there is no direct link from the anonymised file back to the source file.

A tool, or a new application in the smi binary, could achieve this by looking-up the original path:

Either in the CohortPackager database for the extraction, or
in the metadata database

tznind commented 2 years ago

I think the metadata database would be most powerful. That way it could support identifiable UID or anonymous UID and it wouldn't have to rely on an image having been extracted to be able to look it up.

That would enable answering other use cases like 'for this image in the SR NLP db / mongodb, is it in relational too? or not'

tznind commented 2 years ago

Nothing stopping it drawing info from both though.

howff commented 1 year ago

At the moment I've just got a big text file of filenames which I grep ;-)

Another method might be to see if MongoDB can give you a list of keys in the index (by quickly reading the index rather than slowly reading the database), which you could then grep. If it only stores hashes then this won't work.

Another method might be to see if MongoDB can create a computed index, you could create a new index called FileName being computed from Basename(dicomFilePath). Postgres has support for computed indexes, maybe MongoDB does too. Then you could replace the -an.dcm in the anonymised filename and look up the result in the computed index.

Unless I've completely misunderstood what you mean by "metadata database", were you referring to one of the mysql or sql-server databases?

howff commented 1 year ago

Unless I'm mistaken the anonymised path ends with the SOPinstanceUID plus -an.dcm so adding a MongoDB index on SOPinstanceUID would help immensely. Could also add study and series ids?