chop-dbhi / dicom-anon

Python DICOM Anonymizer
BSD 2-Clause "Simplified" License
66 stars 22 forks source link

Uncaught Exception for bad tag in dicom file. #16

Open sgithens opened 9 years ago

sgithens commented 9 years ago

While running a large batch of mammograms today, dicom-anon blew up with the following exception. It appears that the error is in the mammogram file, but I'm wondering if we shouldn't be blowing up when it occurs.

Would the appropriate action in this case be to quarantine the image? I can try to put together a patch if you think this is the appropriate behavior. (If not, what should we be doing in this case?)

Traceback (most recent call last):
  File "dicom_anon.py", line 915, in <module>
    white_list_file=white_list_file, log_file=options.log_file, rename=options.rename,profile=options.profile, overlay=options.overlay)
  File "dicom_anon.py", line 760, in driver
    ds = anonymize(ds, white_list, org_root, profile, overlay)
  File "dicom_anon.py", line 678, in anonymize
    ds.remove_private_tags()
  File "/home/sgithens/Envs/opal/lib/python2.7/site-packages/dicom/dataset.py", line 483, in remove_private_tags
    self.walk(RemoveCallback)
  File "/home/sgithens/Envs/opal/lib/python2.7/site-packages/dicom/dataset.py", line 595, in walk
    callback(self, data_element)  # self = this Dataset
  File "/usr/local/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/sgithens/Envs/opal/lib/python2.7/site-packages/dicom/tagtools.py", line 21, in tag_in_exception
    raise type(e)(err)
ValueError: Invalid tag (0018, 1152): invalid literal for int() with base 10: '147.3'
cancan101 commented 9 years ago

It seems that most fields can just be blanked out. i.e. I think this exception is thrown when reading the value for the purposes of writing the audit db.

jeffmax commented 9 years ago

This appears to be thrown from a function in pydicom that is supposed to remove all non-standard tags (https://www.medicalconnections.co.uk/kb/DICOM_Private_Elements). I'll take a closer look.

jeffmax commented 9 years ago

So, interestingly, the field it is throwing the exception on is not a private data element, but the "Exposure" which has a VR of IS (Integer String). It looks like it is just an incorrect DICOM file that has placed a decimal value in an IS field. It looks like remove_prive_tags() recurses over each attribute looking for private tags. It is a shame to quarantine the file completely for such a minor issue, but taking on the task of fixing up bad DICOM files so pydicom will operate on them sounds out of scope for this script so it is probably the best we can do.

This seems to be a similar issue: https://code.google.com/p/pydicom/issues/detail?id=152

jeffmax commented 9 years ago

On the bright-side, looking into this I learned about this: https://docs.python.org/2/library/contextlib.html#contextlib.contextmanager

sgithens commented 9 years ago

@jeffmax Thanks for the diagnostic. Do you think the best thing then is to wrap the anonymize at line 760 in a try/except and then quarantine it the same as on lines 745-749?

cancan101 commented 9 years ago

What line do you think is actually throwing the exception? Because of the way the walk is called, the original call site of the exception is lost in the trace above.

sgithens commented 9 years ago

The first site in dicom_anon or in the third party libraries? I'm pretty sure the trace above has the accurate point for that being dicom_anon.py:678. I was going to wrap it at at 760 though for that anonymize call, since at that point all the arguments necessary to call the quarantine_file method are available.

cancan101 commented 9 years ago

So the issue looks like pydicom tried to parse the element even though you are just checking the is_private status:

pydicom/dataset.pyc in __getitem__(self, key)
    306                 character_set = default_encoding
    307             # Not converted from raw form read from file yet; do so now
--> 308             self[tag] = DataElement_from_raw(data_elem, character_set)
    309         return dict.__getitem__(self, tag)

One option would be to have DataElement_from_raw inject a default value: https://github.com/darcymason/pydicom/blob/9a97a7955a6bcb5015bb65e5c55e42597eb164b4/source/dicom/dataelem.py#L319-L324

sgithens commented 9 years ago

I think for this particular issue ( a bad tag in my dicom file), I probably would tend to concur with @jeffmax above, I don't really want to patch the pydicom project for this issue. I didn't run in to this until about mammogram 1500 of my 5000 I'm churning through right now, so I don't expect it to have too often, and I'm Ok with just quarantining it for now to deal with manually later. (Although if it does become more of a problem I'd likely dig further into pydicom! :p ) I'm syncing my directories for the ones that have already been processed right now, but after that am going to go ahead and wrap that ValueError and run it again. ( If everything is awesome after that will submit a pull. )

jeffmax commented 9 years ago

@cancan101 started looking at pydicom tickets related to this issue and it seems you have been involved in the discussion

https://github.com/darcymason/pydicom/pull/193 and https://github.com/darcymason/pydicom/pull/197

I guess I will leave this ticket open for now in case there is an API change that allows us to handle this better.

jeffmax commented 9 years ago

Looks like there is a good solution for addressing this in https://github.com/darcymason/pydicom/pull/198 Will keep this open until this is pulled into a release. Thanks @kevlarkevin!

sgithens commented 9 years ago

Sounds good thanks guys! If there is a fix in pydicom and we pull in a new version and you want me to test on any of my broken dcm's let me know!