'gs://idc-tcia-nsclc-radiomics/dicom/' link broken

adwaykanhere commented 2 years ago

Thank for you making these example notebooks!

I'm trying to replicate the demo file nsclc_radiomics_demo_release.ipynb

When I run the code cell for gsutil to download the data by reading the gsutil file path, I cannot download any files to my colab instance. The code to query all the data and generate the gsutil file runs properly. But I get no files downloaded when I'm running

%%capture
# if everything is allright, proceed with the download
!mkdir -p data/nsclc-radiomics/dicom

!cat gcs_paths.txt | gsutil -u trialidc -m cp -Ir ./data/nsclc-radiomics/dicom

While the notebook shows that for 25 patients the storage space should be 1.7Gb, when I run the code it downloads only 8kb which shows that the download was not successful.

When I query

https://storage.googleapis.com/idc-tcia-nsclc-radiomics/1.3.6.1.4.1.32722.99.99.203715003805996641695765332389135385095

I get a message saying no storage bucket found. Is the storage bucket name and object name depricated?

Please let me know how I can fix this

fedorov commented 2 years ago

@denbonte can you please help the user? Does this notebook need to rely on a storage bucket?

fedorov commented 2 years ago

We should go over the notebooks and update them, and also archive those that are duplicate and/or do not add value.

This statement is no longer accurate, since all of our buckets are now free, and are NOT "requester pays": "The Imaging Data Commons GCS buckets are "requester pays" buckets."

denbonte commented 2 years ago

Hey @adwaykanhere - good morning, and thanks for bringing this up!

Does this notebook need to rely on a storage bucket?

@fedorov it does, yes - and one that has indeed been deprecated (it has to do with multiple progresses with the platform - when we released that notebook, the NSCLC-Radiomics dataset was not even officially in IDC yet).

I will update the notebook ASAP and comment here on what needed to be changed, why, etc.!

adwaykanhere commented 2 years ago

@denbonte - Good morning, thanks for reaching out to help and also for creating the notebook!

It would be great to get a free bucket to access this data.

I was able to access the files using the GCS Healthcare API using the following bucket gs://gcs-public-data--healthcare-tcia-nsclc-radiomics/dicom/ but I think this bucket is 'requester pays'

denbonte commented 2 years ago

Hey @adwaykanhere !

It would be great to get a free bucket to access this data.

I was able to access the files using the GCS Healthcare API using the following bucket gs://gcs-public-data--healthcare-tcia-nsclc-radiomics/dicom/ but I think this bucket is 'requester pays'

All the data that IDC ingests and hosts are completely free to access - so you don't need to care about that 😃

Plus, as opposed to the GCS Healthcare API, the IDC supports the creation of cohorts that aggregate data from different datasets, and supports versioning as well!

The bucket that is giving you problems was indeed something temporary we moved away from (before, IDC buckets were requester pays, so we had to deal with things differently - but not anymore since some months!)

We will push a fix to the notebook in a matter of days for sure, as it's nothing special - but of course I want to make sure everything works and is documented properly!

Thanks for the patience and sorry for the inconvenience!

adwaykanhere commented 2 years ago

@denbonte Great! Thank you so much for your support.

denbonte commented 2 years ago

Hey @adwaykanhere - good morning!

I have just pushed an updated version of the notebook, which you can find at the same link as before! I have updated quite a few things, so I will try to make a list here below and include some of the rationale behind those.

Fixing the Main Bug

As discussed above, the notebook was implemented when the dataset in question was not hosted by IDC yet. Right now, IDC hosts 100+ collections and > 30 TB of imaging data, among which there is the NSCLC-Radiomics dataset used in this (and others) use-case.

Since the dataset is hosted on IDC, instead of needing a manifesto to download all the data the user can query the IDC tables. Here follows an example, the one used in the updated version of the notebook, for selecting all the patients from the NSCLC-Radiomics collection:

SELECT
  PatientID,
  StudyInstanceUID,
  SeriesInstanceUID,
  SOPInstanceUID,
  gcs_url
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  Modality IN ("CT",
    "RTSTRUCT")
  AND Source_DOI = "10.7937/K9/TCIA.2015.PF0M9REI"
ORDER BY
  PatientID

FYI, in other example notebooks (e.g., AI-based Thoracic Organs at Risk segmentation from CT scans) we show how do build cohorts using the platform itself to select specific subsets (e.g., "only patients from dataset X with manual segmentations for Y for which the CT SliceThickness is Z").

You can find additional information of the organization of the data on IDC at the dedicated page of the documentation.

I have also cleaned up the download code a bit and made sure to stick with the current best practices we use for the example notebooks.

Introducing DICOM Data Sorting

As you will notice from cells of the notebook, when downloading data after parsing the IDC BigQuery tables, there is no structure to the dataset at all. Using dicomsort, the user can uniform the structure of the DICOM dataset prior to any pre-processing operation (so that the pre-processing code coming next/downstream can stay identical even if the dataset changes)! The tool works exploiting DICOM metadata, so this always work on IDC-hosted datasets (and also likely non-IDC-hosted, provided no important metadata fields are stripped from the dataset).

Restructuring the Data Pre-processing and AI Inference

Minor fixes, really, propagating all the changes that I described above. For instance, I removed some logs that were redundant now that the data is hosted on IDC!

I re-ran the whole notebook several times, and everything went smooth - but I could have very well missed something, so don't hesitate to point extra bugs out! It shouldn't take more than 10-15m to download, preprocess, and run inference on the 10 subject cohorts I'm using in the example; analysing bigger cohorts (IDC or not) with Google Colab for free is very feasible as well - it will just take a bit more time (note how the pre-processing part is the slowest one, but that is always the case when resampling etc. is involved, especially on instances with a low count of cores!)

If you have any other questions, or curiosities, or feedback, don't hesitate to reach us also at the IDC Discourse Forum!

No matter how basic the doubts you have - we are here to help people familiarise with the cloud resources and the platform 😃

adwaykanhere commented 2 years ago

Good morning, @denbonte!

This is fantastic! Thank you so much for all your support.

ImagingDataCommons / IDC-Tutorials