Proposal: Defining and Prototyping "Labelmap" Segmentations in DICOM Format

Project Description

The DICOM Segmentation format is used to store image segmentations in DICOM format. Using DICOM Segmentations, which use the DICOM information model and can be communicated over DICOM interfaces, has many advantages when it comes to deploying automated segmentation algorithms in practice. However, DICOM Segmentations are criticized for being inefficient, both in terms of their storage utilization and in terms of the speed at which they can be read and written. This is in comparison to other widely-used segmentation formats within the medical imaging community such as NifTi and NRRD.

While improvements in tooling may alleviate this to some extent, there appears to be an emerging consensus that changes to the standard are also necessary to allow DICOM Segmentations to compete with other formats. One of the major reasons for poor performance is that in segmentation images containing multiple segments (sometimes referred to as "classes"), each segment must be stored as an independent set of binary frames. This is in contrast to formats like NifTi and NRRD that store "labelmap" style arrays in which a pixel's value represents its segment membership and thus many (non-overlapping) segments can be stored in the same array. While the DICOM Segmentation has the advantage that it allows for overlapping segments, in my experience the overwhelming majority of segmentations consists of non-overlapping segments, and thus this representation is very inefficient when there are a large number of segments.

The goal of this project is to gather a team of relevant experts to formulate changes to the standard to address some issues with DICOM Segmentation. I propose to focus primarily on "labelmap" style segmentations, but I am open to other suggestions for focus.

The specific goals would be to complete or make significant progress on the following:

Formulate changes to the standard to allow for labelmap segmentations (@dclunie)
Complete prototype implementations within the highdicom library (of which I am a maintainer), dcmjs (@pieper) and possibly dcmqi (@fedorov )
Create example datasets for dissemination to others wishing to implement the changes
Begin the process of reaching out to others in the open source community to accelerate other implementations, particularly viewers such as slicer (@pieper ) and OHIF

Open questions:

Should we implement a new IOD or a new SegmentationType within the existing Segmentation IOD?
Should we implement "instance" segmentations, in which each segment is assumed to be a different instances of the same type, and thus need not be described separately, in addition to label-map style semantic segmentations?
Should we also allow 16 bit pixels to allow for more segments? How does this interact with the choice of new IOD vs new SegmentationType?

Other possible (alternative) topics:

Single bit compression to allow for more space-efficient storage
Omitting the per-frame functional group (like TILED FULL) for other types of segmentation image.
The inefficiency of pydicom in parsing long sequences, such as the per-frames functional groups sequence in segmentations, is a key bottleneck in Python. We could think through how to overcome this

Relevant team members: @fedorov @dclunie @pieper (@hackermd ) please give your feedback to help shape this project!

I think DimensionOrganizationType = 3D might be it actually. Maybe this one can be used in the new LABELMAP object?

Will respond further later, but I am aware of this but think it needs clarification. Does it require that ImageOrientation is in the shared functional groups? It appears not to. Is this incompatible with omitting empty frames?

I would vote against any proposal that excluded imaging modalities because that probably means the proposal hasn't been well enough thought out yet.

It is a provocative thought. Is RTSTRUCT compatible with SM? Is SM Bulk Annotation compatible with MR? Is such compatibility and resulting complexity truly warranted?

I completely agree compatibility of this kind should be considered and explored, but I also strongly believe there are limits on trying to make things compatible across domains that have very different needs, communities and experiences. It is not a black and white situation. I would caution against making decision on whether to vote or not for a specific proposal based on such a general requirement.

A couple of additional thoughts on this from the perspective of what is already in the standard and what might need to be added:

you can't mess with the existing SOP Class in a manner that is not backward compatible, so even if you were to try to re-use the existing SEG IOD, new parts of it (such as a labelmap) would need to be defined as conditional (on the new SOP Class), so as to not break existing implementations - cleanest is just to create a new labelmap IOD and SOP class and reuse relevant parts of the existing Segmentation IOD
you need to think about PhotometricInterpretation and what value is semantically accurate and is appropriate for the compression schemes; a labelmap is an index and the values are arbitrary and not monotonic; this has an impact on the model underlying compression schemes that care about that (e.g., have a predictor), in theory if not in practice; so, the closest thing in DICOM is PALETTE COLOR (i.e., you can't use MONOCHROME2) - this has three implications (1) there can only be one label map in an instance (because the index values have to be in the same "space"), (2) if color information is to be sent it would normally be in a color palette table rather than the SEG way (though the NM IOD, unusually, allows PALETTE COLOR and does not require a color LUT to be encoded (colors are unspecified), (3) for PALETTE COLOR only lossless compression schemes are applicable and even though permitted may not work as well as expected, depending on the values chosen for the indices and their spatial relationships
DICOM already has a means of identifying and sending the color palette separately (e.g., as a well-known or user supplied stored palette that is patient independent), and one might want to extend this mechanism to also include the semantics of each label (e.g., add codes to the color palette IOD or something like it)
what bit depth is needed for labelmap pixel data - is 8 (OB) or 16 (OW) sufficient or do you also want to allow for 32 (OL) or 64 (OV), recognizing that toolkit support may be an issue for the larger ones - for instance of class label maps maybe the larger the better, but if you go that route, how do you encode what class the instance is of (answer is probably separate SOP instances, one for each class, indices are only instances not classes) - as a use case, to encode every nucleus in a whole slide image labelmap at with 100 nuclei per tile and 200,000 tiles 32 bits would be more than sufficient but 16 bits would not.
what bit depth is needed for the color values if they are encoded as a color palette - is 8 sufficient or is 16 needed?

I have now created a full project definition for this project: #710

I also wanted to follow up on the broad topic of the per-frame metadata. I generally feel that the flexibility/expressiveness of the per-frame functional groups and dimension organization sequence is a strength but the cost in terms of simplicity for the majority of simple use cases is too high. Having written code in highdicom to interpret the metadata and use it to "reconstruct" segmentation masks from the frames, I can say that this process feels far more complicated than it ought to be. However I do believe it is possible to allow the flexibility for those who need it while introducing optional attributes that make it much simpler to work in the very common special cases. There are two major related issues currently:

The overwhelming majority of segmentations store frames in regularly spaced increments along each dimension. A receiver should be able to determine whether this is true, and if so determine the spacing along the dimension, without having to parse the metadata of each frame and perform arithmetic operations to determine the spacing. When thinking about the very common case of 3D images, there is a mechanism by which the creator can convey that planes are equally spaced in 3D space by setting the DimensionOrganizationType to '3D'. This helps a bit, but does not require that SpacingBetweenSlices attribute be present in the SharedFunctionalGroupsSequence, so the receiver in the general case still needs to calculate the spacing for themselves. Neither does it actually require the ImageOrientationPatient to be present in the SharedFunctionalGroupsSequence. So really the '3D' DimensionOrganization is largely "toothless". It is worth noting that nothing in this issue is specific to Segmentations, it is true of all multiframe DICOM object. Furthermore I would assume this rules out omitting empty slices, though this is not totally clear to me right now (for this reason highdicom does not currently ever create segmentations using the 3D dimension organization type).
In my opinion, a much worse problem is when the above is combined with the fact that empty slices may be omitted from segmentations. I actually feel that this is a very well-motivated decision, since segmentations of medical images are often very sparse (think segmentation of lymph node in a chest/abdomen CT) and it makes sense to save the space by omitting them, but the way it has been implemented gives rise to a number of problems. The core of the issue is that there is nowhere in the segmentation object to store information about the slices that were omitted due to being empty. The first problem this gives rise to is ambiguous semantics: if a slice is not explicitly listed as a source image of any segmentation frame, does that mean that it wasn't segmented or that it was segmented and the segment(s) was/were found not to be present so the slice was omitted when the segmentation was created? These are semantically very different but not distinguished currently. The second is the "reversibility" problem. Programmers working with Seg objects would reasonably assume that if they pass a segmentation mask to a routine to create a Segmentation instance, store that file, then read it in again, that they would be able to recover exactly the same mask that they put in. In fact this is not possible currently, because there is nowhere in the segmentation object to store information about the slices that were omitted due to being empty. Currently, in highdicom we at least store the ordered list of segmented instances in the ReferencedSeriesSequence at the root of the object, but this is just our convention and not one that can be relied upon to be understood by other implementations.

Having thought about this on and off for a while, I am fairly convinced that my strong preference for the best way to fix all of these issues would be to do (optionally) for 3D images what is done for tiled images by introducing the concept of a "3D TotalPixelMatrix" (a TotalVoxelVolume?), and link it to a new value of DimensionOrganizationType (e.g. "3D_VOLUME") that actually implies some requirements. The 3D array described by the TotalVoxelVolume would conceptually exist, even if not every voxel within it is actually explicitly encoded within the dataset. Analogously to the TotalPixelMatrix's TotalPixelMatrixOriginSequence TotalPixelMatrixRows and TotalPixelMatrixColumns, the information about the origin and full size of this TotalVoxelVolume would be explicitly recorded (and I would also add spacing between slices as a requirement), and individual slices could give their SlicePositionInTotalVoxelVolume, analogously to RowPositionInTotalImagePixelMatrix and ColumnPositionInTotalImagePixelMatrix are now used for tiled images. This way it would be very clear that the slices present exist within a known 3D volume with an explicitly defined spatial affine matrix. I would be very interested to hear people's thoughts on this, and its plausibility. I would like to discuss this at project week, but I'm not sure whether we will have the time to make this concrete (even assuming there would be a consensus behind it).

Failing this, I would propose that we either disallow omission of slices from our new segmentation IOD or introduce some mechanism that stores information about omitted frames somewhere in the segmentation instance.

Thanks @dclunie for the thoughtful reply. This all makes sense to me.

cleanest is just to create a new labelmap IOD and SOP class and reuse relevant parts of the existing Segmentation IOD

Originally I was hoping we could avoid this (by creating a new value of SegmentationType) but at this point I think I am convinced that a new IOD is required. However, I also feel that this shouldn't mean that we "abandon" the old one, since I feel that there would still be value in binary segmentation IOD and I believe there are things there that could be improved (mostly regarding pixel compression).

you need to think about PhotometricInterpretation and what value is semantically accurate and is appropriate for the compression schemes

I agree that using PALETTE COLOR is the best option currently available. I take your point about the issue of the values being non-ordinal and this breaking the assumptions of compression schemes and therefore being potentially sub-optimal. In practice I am not too concerned as I suspect that, say JPEG LS Lossless would still work well, if sub-optimally, on these images and would be considerably better than the current situation and therefore would be a practical compromise. It would be good to do some experiments though to see how well it works in practice.

I am less keen on the semantics of PALETTE COLOR as I feel that segmentation pixel values semantically encode the labels rather than colors. It certainly makes sense for the segmentation SOP Class to suggest that certain colors could be used for display, but a viewer should be at liberty to change this and display the segments how it likes, in my opinion. Would it be crazy to define a new PhotometricInterpretation that is largely similar to PALETTE COLOR practically speaking but does not explicitly imply a color mapping?

there can only be one label map in an instance (because the index values have to be in the same "space")

Yes, I would also prefer this, such that the existing SegmentSequence does not have to change. There would have to be a little thought about how to do this if we introduce "layers" but I do not foresee any insurmountable problems

one might want to extend this mechanism to also include the semantics of each label (e.g., add codes to the color palette IOD or something like it)

I have to say that I am not at all keen on this idea. I feel that the meaning of the segmentations should be encoded within the segmentation if the object is to act as a clinical record of some sort of segmentation process.

how do you encode what class the instance is of

I think we need to make sure we are discussing the same thing here. To use the terminology of computer vision, semantic segmentation is where pixel values denote the "class" of the pixels (e.g. "nucleus") and instance segmentation is where the pixel values denote the instance of the class (e.g. "nucleus number 123456"). I have primarily been discussing semantic segmentation, however it would be nice to be able to support instance segmentation too (so that the segment description only needs to appear once). Probably we could do this quite easily with a single code string attribute telling you which of the two it is, and in the case of instance segmentation, limit the segment sequence to length 1 with all pixel values then representing a single instance of that single class. However I do not think it would be wise to try and support a mixture of the two (i.e. multiple classes each of which have potentially multiple distinct instances) within a single instance. That would add considerable complexity. I don't know of any format that can realistically do that, even the vaunted nrrd.

what bit depth is needed for labelmap pixel data - is 8 (OB) or 16 (OW) sufficient or do you also want to allow for 32 (OL) or 64 (OV), recognizing that toolkit support may be an issue for the larger ones

This is a good point. I would probably err towards allowing up to 32 bit (I can only imagine this being practical for "instance segmentation" style arrays rather than "semantic segmentation style arrays" otherwise the segment metadata would get absurd) accepting that there may be some work required on toolkits to support this. I anticipate that 16 bit would be sufficient for the overwhelming majority of cases so most segmentations would be usable immediately. Definitely something to discuss further.

what bit depth is needed for the color values if they are encoded as a color palette - is 8 sufficient or is 16 needed?

I don't know, would probably want to think this through further with someone who writes a viewer

NA-MIC / ProjectWeek

Proposal: Defining and Prototyping "Labelmap" Segmentations in DICOM Format #643

Project Description