ImagingDataCommons / CloudSegmentator

Medical imaging segmentation workflows for FireCloud (Terra) and Seven Bridges Cancer Genomics Cloud
Apache License 2.0
3 stars 2 forks source link

Simplify the manifest organization and data access #45

Open fedorov opened 8 months ago

fedorov commented 8 months ago

As discussed today, we should be able to parameterize execution of the workflow with just the list of SeriesInstanceUIDs and the version of IDC that should be used. We should also switch to using idc-index for resolving to series URLs and downloading the files.

This is dependent on update of idc-index to IDC v17, and other updates which I am working on in https://github.com/ImagingDataCommons/idc-index/pull/23.

vkt1414 commented 8 months ago

I was able to pass the list of SeriesInstanceUIDs to papermill and to use idc-index. But as I imagine how these lists are passed on from Terra and Seven Bridges (SB-CGC), I realized that batching in SB-CGC is possible only by files. i.e previously I was passing the csv manifests to papermill on both Terra and SB-CGC. Now with the lists, we cannot run SB-CGC in batches without csv manifests..

one way to make this work is, we continue to use manifests for SB-CGC, but the papermill commands for Terra and SB-CGC will be different if we use list of SeriesInstanceUIDs

what do you think?

image

fedorov commented 8 months ago

I am not sure I understand - I thought we keep using CSV, just change its content. I thought about this issue as about simplifying manifest - not getting rid of it.

vkt1414 commented 8 months ago

I see. Getting rid of it makes sense for Terra users as they can simply import the sample data table we will provide. But with csv manifests..they will need to update the location of the csv manifests in the data table to get started.

this is all that's needed for Terra users image

fedorov commented 8 months ago

I don't have the experience to weigh in, I am afraid .... But we should make sure IDC version is included as a column!

vkt1414 commented 8 months ago

I forgot about the idc-version. I will add the column. I think passing the list of seriesInstanceUIDs directly from the data table is the cleanest and most simplest approach for Terra. So if it is okay with you to use different papermill commands on Terra and SB-CGC, I'll use this approach. Otherwise I can stick to the csv manifest approach.

fedorov commented 8 months ago

So if it is okay with you to use different papermill commands on Terra and SB-CGC

Yes, it's ok - at least for now

vkt1414 commented 8 months ago

I realized I could use the same papermill command but use an additional step for SB-CGC, which is just reading the manifest and passing the list of seriesInstanceUIDs to the papermill command.