Experiment with grouping segments into separate SEG instances to bring per-frame FG size under 1MB

fedorov commented 7 months ago

@vkt1414 here's the idea. Can you find the largest (by the number of slices) CT series, confirm its per-frame FG size is ~3.9M, and then experiment with splitting segments into separate SEG instances? We could, for example, follow roughly the grouping used in the TS front page https://github.com/wasserth/TotalSegmentator?tab=readme-ov-file#totalsegmentator, which already defines 5 groups. It might be best to start with just separating the "skeleton" group, since it might be the one resulting in the largest number of frames. If that group alone results in FG size above 1MB, then we could split ribs from vertebrae, for example.

cc: @dclunie

fedorov commented 7 months ago

Series identified by @vkt1414 for confirming we are within limits while grouping:

NLST series with the maximum number of slices: https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.7009.9004.191799899850067375695919133985?seriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.7009.9004.585698669140368469890059378244
NLST series with the maximum coverage extent: https://viewer.imaging.datacommons.cancer.gov/viewer/1.2.840.113654.2.55.336089992016384867199816234549867125059?seriesInstanceUID=1.2.840.113654.2.55.219115117178499130380292748385113560520
non-NLST series with the largest number of slices (2464!): https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.99.1071.24993177073256607564948872275593?seriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.99.1071.13277129293167305892649949655853

fedorov commented 7 months ago

@vkt1414 here's the script that breaks segment frames into 4 groups - can you update it to save into groups based on the TS group assignment? https://github.com/ImagingDataCommons/CloudSegmentator/pull/53

vkt1414 commented 7 months ago

@vkt1414 here's the script that breaks segment frames into 4 groups - can you update it to save into groups based on the TS group assignment? #53

Thank you!

I could not fully understand the script. The csv does not have labelIDs and since segment numbers are not necessarily same as labelIDs, I'm not sure how we can use segment numbers in DSEG for finding which class they belong to

vkt1414 commented 7 months ago

I guess it's doable, but we will also need the SEG NIfTI file. Please correct me if there's a better way than this:

Map the CSV file with label IDs from TS v1.
Extract label IDs from NIfTI, sort them in ascending order, and then map them to DICOM SEG segment numbers.
Based on the label IDs, determine the class they belong to, divide the PerFrame group into up to five DICOM files with your code, and then calculate the size of the resulting DICOM files.

fedorov commented 7 months ago

Vamsi, that script has the only purpose to confirm that if we split segments into groups by categories defined in TotalSegmentator, the total size of the per-frame FGs sequence is below 1MB.

It is not the purpose of this script to split the SEG as part of the overall workflow.

fedorov commented 7 months ago

The csv does not have labelIDs and since segment numbers are not necessarily same as labelIDs

Independently from this specific issue, this is an important point. I am thinking that it may make sense to encode TotalSegmentator label name in SegmentDescription. It may be useful to the users. What do you think?

vkt1414 commented 7 months ago

Here's the notebook and sample files (which are created by running this notebook) that I think extends your code..

https://colab.research.google.com/drive/1R4VyzgrVxoRyg9ngpU4J9LjEOb3g-z-C?usp=sharing https://drive.google.com/file/d/19j-ilkjpR3zclDXfRqOLjochDm2Agw-6/view?usp=sharing

fedorov commented 7 months ago

@vkt1414 communication in discord:

even for 800 slices, except for gastrointestinal tract group, all others were slightly or significantly over 1 MB.

https://github.com/vkt1414/CloudSegmentator/releases/download/test/800-slices-perframe.zip https://github.com/vkt1414/CloudSegmentator/releases/download/test/800_slices_seg_nifti_dseg.zip

fedorov commented 7 months ago

Based on the experiments, discussions and reflections, I decided we should not attempt this optimization for the current experiment for the following reasons:

for the largest NLST series, even after splitting into the TS-defined categories, 4 out of 5 SEGs have the SQ above 1MB
in order to optimize conversion, we would need to add an option to dcmqi to skip labelIDs that are not present in JSON - currently, this triggers runtime error
the layout of the DICOM study can become much more complex/confusing: currently, we have 2 SRs with features for the single SEG; with the new approach, we would either have to have 2*5 SRs, or combine features across multiple SEG instances
there is a possibility that splitting into groups would result in a significantly higher memory footprint for OHIF, Slicer and other tools, since it may we that a new in-memory volume of the CT size is allocated for each SEG loaded
most importantly, we would need to do non-trivial changes in many places to the current workflow implementation, leading to the possibility of new errors, and ultimately delays in releasing the data

We may revisit this at a later time at a subsequent iteration of processing, if such iteration takes place.

ImagingDataCommons / CloudSegmentator

Experiment with grouping segments into separate SEG instances to bring per-frame FG size under 1MB #52