ImagingDataCommons / CloudSegmentator

Medical imaging segmentation workflows for FireCloud (Terra) and Seven Bridges Cancer Genomics Cloud
Apache License 2.0
3 stars 2 forks source link

ENH: revisit encoding of algorithm in SR #49

Closed fedorov closed 9 months ago

fedorov commented 9 months ago

Instead of encoding algorithm details for each individual feature in radiomics features SR, do this at the level of the measurement group. This helps significantly reduce the size of the SR content sequence, which should help work around the 1MB SQ limit of Google Healthcare.

In the process of implementing this refinement, clean up the code to make it possible to complete the test runs much faster, and make the code a bit more robust.

review-notebook-app[bot] commented 9 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

vkt1414 commented 9 months ago

Sorry for the late review and in the interest of time, I did a couple of things:

  1. I noticed that colab link to the notebook is removed, may have been by mistake (so I added it)
  2. I would like to keep the notebook containing content only the relevant to the task. I do understand that it is used for validating the DICOM files but I do not see relevance in the production notebook. So, I'll remove the dcm2k cell and keep it for future reference, as I replicate the environment in docker container as well as notebooks if you look at the dockerfiles.

if 'google.colab' in sys.modules: !cd /content && wget https://github.com/DCMTK/dcmtk/releases/download/DCMTK-3.6.6/dcmtk-3.6.6.tar.gz !bunzip dcmtk-3.6.6.tar.gz !tar -xvf dcmtk-3.6.6.tar !export DCMDICTPATH=/content/dcmtk-3.6.8-linux-x86_64-static/share/dcmtk-3.6.8/acrnema.dic

fedorov commented 9 months ago

I do understand that it is used for validating the DICOM files but I do not see relevance in the production notebook. So, I'll remove the dcm2k

I actually strongly disagree with this. There are important benefits to keep that cell, and really no downsides.

  1. "production" only applies to the non-colab execution of the notebook. The purpose of the colab mode is: 1) help users understand how things work; 2) help users reuse individual components outside the workflow; 3) help maintainers test, troubleshoot and refine the workflow. dcmtk tools are absolutely critical for validating that the workflow actually works as expected and checking the content of the produced files. If we follow your argument, we would have to remove, for example, test input files, since they have no role in the production process!
  2. that cell is not affecting cloud production runs at all, since no code is executed in that situation.

The main issue is that at the moment dsrdump fails with segmentation fault, and that's why I labeled that cell as WIP. But even with that deficiency, it is better than nothing - it's a starting point to look into why it fails and fix it, which should be feasible.