ImagingDataCommons / highdicom

High-level DICOM abstractions for the Python programming language
https://highdicom.readthedocs.io
MIT License
169 stars 35 forks source link

Errors in input data and legacy conversion #17

Open fedorov opened 4 years ago

fedorov commented 4 years ago

While evaluating legacy sop classes, @afshinmessiah and I ran into the not unexpected issues related to invalid input data. At least in some cases, those errors are rather trivial, such as mismatch of VR between SH and CS.

This raised the discussion with @dclunie below. Even after this discussion, I personally think it would make more sense and would be more practical to try to fix issues as they come up in the process of legacy conversion:

@hackermd did you think about this?


From: David Clunie Date: Thu, Feb 20, 2020 at 10:33 AM Subject: Re: MF conversion and source dataset errors To: Andrey Fedorov Cc: Akbarzadeh, Afshin, Steve Pieper

Hi Andrey

I also copied Steve.

In short, probably option 2 (patch the source dataset, and then do the conversion).

It depends a lot on exactly what the "errors" are, and what you would do with any intermediate files.

E.g., if there is an error that a value is invalid for the VR, (e.g., a bad character or too long), and the data element is being copied (either into the top level data set of the new object or into a functional group, e.g., per-frame unassigned), then the choice is to "fix" it (remove bad character, truncate too long string) before copying.

Alternatively, if it is an optional attribute, one could just drop it (not copy it); but that risks losing something useful.

I don't always bother fixing these when converting in bulk, and just propagate the errors, since trying to find and fix each special case may be more work than I can justify.

But if one can make a fix, it would be nice to.

There is also an update to the standard that allows saving the original bad values; see CP 1766 and Nonconforming Modified Attributes Sequence:

  ftp://medical.nema.org/medical/dicom/final/cp1766_ft_ExtendOriginalAttributesSequence.pdf

  http://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.12.html#sect_C.12.1.1.9.2

In terms of "when" to do the fix, if you are going to fix things, I have done it both ways (and sometimes another way, which is to propagate the errors into the multi-frame, and then fix them up in a yet another separate final cleanup step).

I assume that when you say "patch the source dataset", you mean fix a temporary copy, not "return a fixed copy to TCIA to replace their bad stuff".

In which case, either approach (or a combination of both) seems fine to me, since any intermediate file don't need to be persisted.

In the past, when creating "true" enhanced MF samples for CT and MR (for the NEMA sponsored project), I actually used my "antodc" tool in dicom3tools to "fix" and "enhance" the single frame objects, by converting stuff from private attributes to standard attributes (even if they weren't in the single frame IOD), and then handled the "merging" into multi-frame (and factoring out of shared stuff) in dcmulti.

This worked well because I separated most of the modality-specific knowledge from the generic single to multi-frame conversion, as well as providing a useful tool for making single frame images "better", when I didn't need to make multi-frame ones.

This was long before I added the MultiframeImageFactory to the PixelMed toolkit, and I have propagated very little if any of the modality-specific stuff to that tool so far.

When/if I revisit the question of trying to create modality- specific legacy converted or true enhanced multi-frame images in PixelMed, I will very likely use the two step process of first fixing the single frame headers, and then merging them into a multi-frame, since I find that division of labor more elegant.

It would also allow me to provide other input sources (e.g., NIfTI files with a BIDS metadata file) to feed the same DICOM enhanced multi-frame pipeline,. Though I have to admit that I usually do that sort of thing with separate classes with methods applied successively, rather than separate distinct command line utilities.

BTW. For referential integrity updates (e.g., fixing all SRs or SEGs that reference single frame images to reference the new MF objects), I would might make that yet another separate step in the pipeline, especially if I could find other uses for it.

David

PS. I have attached a copy of antodc.cc from dicom3tools ... I haven't used this tool for more than a decade, but it may give you some insight into more complicated fixes I sometimes used to perform, and how I extracted standard information from private data elements.

PPS. In case they are informative, I have attached archives of the Makefiles that I used for the CT and MR NEMA project ... these will not execute without my source images, various tools and out of band information, but they may give some (outdated) insight into the process of handcrafting examples versus producing an operational converter.

On 2/19/20 11:18 PM, Andrey Fedorov wrote:

Hi David,

As Afshin is working on the MF conversion task, we wanted to ask you a

"fundamental" question.

As you know, TCIA datasets may often have errors. What should be the strategy for addressing those? Should we:

1) carry forward those errors into MF representation, and just ignore those while validating MF? 2) patch the source dataset, and then do the conversion? 3) patch the errors in the MF representation in the process of conversion, and keep the originals intact?

I would probably prefer option 3.

AF

hackermd commented 4 years ago

This is a tricky question. Generally, I would argue that the library has to assume passed data sets are standard compliant and I would be cautious about just "fixing" things magically. The problem with this approach is that users will keep thinking that there is no problem with the data sets they provided and will be surprised if things break elsewhere.

On the one hand, I think it would be reasonable to encourage the user to correct the input data set if there are any "errors" in the input single-frame data sets that raise an exception upon conversion. Note that this would not necessarily require the end user to change the original images (files), but the application that uses the highdicom library could apply these changes at Python runtime before passing the data sets to the constructor of the Legacy Converted Enhanced MR/CT/PET Image class. However, it may ultimately not be clear how attributes of the output (multi-frame legacy converted enhanced object) relate to attributes of the inputs (single-frame legacy objects).

On the other hand, I see the opportunity for legacy conversion to further enhance data sets by addressing compliance issues and the Nonconforming Modified Attributes Sequence would be an elegant way to do this in a transparent manner. However, if we decide to correct compliance issues internally, we should at least log an ERROR message to indicate to the user that there was a problem with the input data. The log output could serve users as a conversion report.

Personally, I favor the second approach even though the would mean that the library takes on more responsibility.

dclunie commented 4 years ago

@fedorov wrote "if we first patch the input datasets, and then do the conversion, we would need to reference the ephemeral patched datasets, if we want to do things right"

I would keep the same SOP Instance UIDs for the "ephemeral" patched single frame instances since they are being discarded and are only a temporary intermediate step; so there would no special effort required to make sure that the references to the source from the legacy enhanced multi-frame images was to the originals.

Further (wrt. "do[ing] things right"), even in a production environment, not every change to a DICOM attribute requires a new SOP Instance UID, but that is a complicated discussion (and common misunderstanding) that we don't need to get into here.

Also, since legacy conversion normally dumps everything not otherwise specifically handled into the "unassigned" sequences, I am not sure how you will determine what is "important for legacy conversion" (all those note "unassigned"?), but it is an interesting question.

Note that for propagating composite context correctly (patient and study information entity attributes in particular, but also common series, frame of reference and instance (esp. SOP Common Module) stuff), I usually handle those separately, rather than dumping (some of them) into unassigned ... this depends on having knowledge of which attributes are which "level" or handling of specific modules (see com.pixelmed.dicom.CompositeInstanceContext as used in com.pixelmed.dicom.MultiFrameImageFactory; these are a bit out of date wrt. the current standard in this respect, so I will update them).

On 2/21/20 11:57 PM, Andrey Fedorov wrote:

While evaluating legacy sop classes, @afshinmessiah https://github.com/afshinmessiah and I ran into the not unexpected issues related to invalid input data. At least in some cases, those errors are rather trivial, such as mismatch of VR between SH and CS.

This raised the discussion with @dclunie https://github.com/dclunie below. Even after this discussion, I personally think it would make more sense and would be more practical to try to fix issues as they come up in the process of legacy conversion:

  • this will be easier to understand and use for users trying to use legacy conversion functionality
  • it may be easier to correct only errors that are important for legacy conversion specifically, rather than develop a tool that tries to fix all errors (e.g., not all of the attributes will be used for legacy conversion)
  • if we first patch the input datasets, and then do the conversion, we would need to reference the ephemeral patched datasets, if we want to do things right

@hackermd https://github.com/hackermd did you think about this?


From: David Clunie Date: Thu, Feb 20, 2020 at 10:33 AM Subject: Re: MF conversion and source dataset errors To: Andrey Fedorov Cc: Akbarzadeh, Afshin, Steve Pieper

Hi Andrey

I also copied Steve.

In short, probably option 2 (patch the source dataset, and then do the conversion).

It depends a lot on exactly what the "errors" are, and what you would do with any intermediate files.

E.g., if there is an error that a value is invalid for the VR, (e.g., a bad character or too long), and the data element is being copied (either into the top level data set of the new object or into a functional group, e.g., per-frame unassigned), then the choice is to "fix" it (remove bad character, truncate too long string) before copying.

Alternatively, if it is an optional attribute, one could just drop it (not copy it); but that risks losing something useful.

I don't always bother fixing these when converting in bulk, and just propagate the errors, since trying to find and fix each special case may be more work than I can justify.

But if one can make a fix, it would be nice to.

There is also an update to the standard that allows saving the original bad values; see CP 1766 and Nonconforming Modified Attributes Sequence:

  ftp://medical.nema.org/medical/dicom/final/cp1766_ft_ExtendOriginalAttributesSequence.pdf

  http://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.12.html#sect_C.12.1.1.9.2

In terms of "when" to do the fix, if you are going to fix things, I have done it both ways (and sometimes another way, which is to propagate the errors into the multi-frame, and then fix them up in a yet another separate final cleanup step).

I assume that when you say "patch the source dataset", you mean fix a temporary copy, not "return a fixed copy to TCIA to replace their bad stuff".

In which case, either approach (or a combination of both) seems fine to me, since any intermediate file don't need to be persisted.

In the past, when creating "true" enhanced MF samples for CT and MR (for the NEMA sponsored project), I actually used my "antodc" tool in dicom3tools to "fix" and "enhance" the single frame objects, by converting stuff from private attributes to standard attributes (even if they weren't in the single frame IOD), and then handled the "merging" into multi-frame (and factoring out of shared stuff) in dcmulti.

This worked well because I separated most of the modality-specific knowledge from the generic single to multi-frame conversion, as well as providing a useful tool for making single frame images "better", when I didn't need to make multi-frame ones.

This was long before I added the MultiframeImageFactory to the PixelMed toolkit, and I have propagated very little if any of the modality-specific stuff to that tool so far.

When/if I revisit the question of trying to create modality- specific legacy converted or true enhanced multi-frame images in PixelMed, I will very likely use the two step process of first fixing the single frame headers, and then merging them into a multi-frame, since I find that division of labor more elegant.

It would also allow me to provide other input sources (e.g., NIfTI files with a BIDS metadata file) to feed the same DICOM enhanced multi-frame pipeline,. Though I have to admit that I usually do that sort of thing with separate classes with methods applied successively, rather than separate distinct command line utilities.

BTW. For referential integrity updates (e.g., fixing all SRs or SEGs that reference single frame images to reference the new MF objects), I would might make that yet another separate step in the pipeline, especially if I could find other uses for it.

David

PS. I have attached a copy of antodc.cc from dicom3tools ... I haven't used this tool for more than a decade, but it may give you some insight into more complicated fixes I sometimes used to perform, and how I extracted standard information from private data elements.

PPS. In case they are informative, I have attached archives of the Makefiles that I used for the CT and MR NEMA project ... these will not execute without my source images, various tools and out of band information, but they may give some (outdated) insight into the process of handcrafting examples versus producing an operational converter.

On 2/19/20 11:18 PM, Andrey Fedorov wrote:

Hi David,

As Afshin is working on the MF conversion task, we wanted to ask you a

"fundamental" question.

As you know, TCIA datasets may often have errors. What should be the
strategy for addressing those? Should we:

 1. carry forward those errors into MF representation, and just ignore
    those while validating MF?
 2. patch the source dataset, and then do the conversion?
 3. patch the errors in the MF representation in the process of
    conversion, and keep the originals intact?

I would probably prefer option 3.

AF
pieper commented 4 years ago

For reference, Slicer provides a DICOMPatcher module that handles a number of common issues. This is another set of code (like the plugins) that we could factor out into highdicom for more general use.

fedorov commented 4 years ago

Thanks for mentioning DICOMPatcher Steve. This reminded me we already had this discussion in the context of dcmqi development. Here are related issues. DCMTK has to deal with a similar situation creating SEGs while carrying forward composite context from the referenced objects, we had quite some discussions about this with @michaelonken:

I need to think about the discussions above!

fedorov commented 4 years ago

Another related component to be mentioned here is https://github.com/innolitics/dicom-standard, which can be used to identify errors that should be patched. @afshinmessiah is going to look into this, Slicer DICOMPatcher and @dclunie tools, and we will think how to proceed. @hackermd maybe after that we should discuss if that validator/patcher belongs to highdicom, or should be a separate tool.