ImagingDataCommons / idc-index

Python package to simplify access to the data available from NCI Imaging Data Commons
MIT License
10 stars 5 forks source link

Support download of manifests that refer to prior IDC versions #100

Open fedorov opened 2 months ago

fedorov commented 2 months ago

As experienced by @bcli4d, perfectly valid IDC manifests that point to items available in prior versions of IDC would not work with idc-index download functionality.

I think we should:

  1. (stopgap measure) improve error reporting to inform the user when no matches were found, and suggest earlier version as a possible reason
  2. (a more involved stopgap measure) allow download of the content defined by the manifest with no validation: if no matches found, it should be possible to pass the manifest to the download function, which would alert the user that sorting the downloaded files into hierarchy will not be possible
  3. we could introduce a cross-version series-level index, which would have all series UUIDs ever present in any of the IDC series, and to manage size would only contain columns that are required for directory sorting, and which would be used for resolving input manifests. @bcli4d do you happen to have something like such a cross-version table in one of your dev projects?
vkt1414 commented 2 months ago

regarding 2: We do have validate_manifest argument in download_from_manifest already

https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L1166

I was thinking about about enabling users to work with any idc-version they would want. For that I thought about creating a table that shows which idc-index contains which idc-index-data. We could simply add idc_version as another argument to download functions. Fetch latest idc-index-data for requested idc version

I don't know if we should encourage research on cross-versioned data unless necessary like working with a new collection introduced in newer idc releases.

fedorov commented 2 months ago

I don't know if we should encourage research cross-versioned data unless necessary like working with a new collection introduced in newer idc releases.

It is definitely necessary. In the general case, we will not know what version of IDC was used to generate a specific manifest. There is also nothing wrong, in principle, in having a manifest that refers to items from different versions.

bcli4d commented 2 months ago

There is a hierarchy of BQ tables, version/collection/study/series/instance, in each idc_vX_dev that captures all idc-collected/generated metadata up through that version. So the hierarchy in idc_v19_dev will capture all such metadata up through v19. (It does not include any dicom metadata. ) There are various views which flatten the hierarchy across all IDC versions (all_joined), restricted to only public data (all_joined_public), or restricted to public data in the current IDC version (all_joined_public_and_current). The results include corresponding URLs. Not sure what additional data idc needs, and the result has way more than you'd want. But, in short, it should be easy to build a table of of all IDC data.

fedorov commented 1 month ago

@bcli4d this is the query that I understand would give me what I need - details for all of the series that are not available in the latest version. Am I correct? I will also need your advice on how to add additional columns to this table form dicom_all across versions!

WITH
  cross_series AS (
  SELECT
    DISTINCT collection_id,
    submitter_case_id AS PatientID,
    study_instance_uid AS StudyInstanceUID,
    series_instance_uid AS SeriesInstanceUID,
    se_uuid AS crdc_series_uuid,
  IF
    (se_sources.tcia, pub_gcs_tcia_url, pub_gcs_idc_url) gcs_bucket,
  IF
    (se_sources.tcia, pub_aws_tcia_url, pub_aws_idc_url) aws_bucket,
    se_init_idc_version,
    se_rev_idc_version,
    se_final_idc_version
  FROM
    `idc-dev-etl.idc_v18_dev.all_joined_public`
  ORDER BY
    collection_id,
    PatientID,
    StudyInstanceUID,
    SeriesInstanceUID)
SELECT
  cross_series.*
FROM
  cross_series
LEFT JOIN
  `bigquery-public-data.idc_v18.dicom_all` AS dicom_all
ON
  cross_series.crdc_series_uuid = dicom_all.crdc_series_uuid
WHERE
  dicom_all.crdc_series_uuid IS NULL

@vkt1414 once we sort this out, I am thinking to add this as a version delta table (right now it is just about 55K rows) to the pypi release, and then use it to try to resolve unmatched UUIDs from the manifest in idc-index.

bcli4d commented 1 month ago

@fedorov, I don't understand why you want to exclude series in the current release, rather that a table of series across all releases. Anyway, the above SQL appears to do what you want. For additional data such as Modality, we need to build a table by doing a union across all versions of dicom_all of selected data, joining it with the above.

vkt1414 commented 1 month ago

@vkt1414 once we sort this out, I am thinking to add this as a version delta table (right now it is just about 55K rows) to the pypi release, and then use it to try to resolve unmatched UUIDs from the manifest in idc-index.

Let's say someone cares about reproducibility and was interested in knowing the version of data they are working with. Per our documentation, this answer is currently answered by calling get_idc_version().

so my point again is to absolutely allow the users to work with any version of data, but not cross versions. In my opinion, that is the only way to truly reproduce results. Now, I'm not completely opposed to the solutions we are trying to find, but we must at least notify the user that we are referring to other versions of idc.

As for the query, I would simply have one table. i.e we could change our idc-index main query to capture every series, pointing to the highest idc version, wherever applicable. I'll write the query and update this

bcli4d commented 1 month ago

The se_rev_idc_version, and se_final_idc_version tell you to which IDC versions a series belongs. I.E. it is in IDC versions se_rev_idc_version through se_final_idc_version if se_final_idc_version != 0. Else it is in IDC versions idc_rev_idc_version through the current IDC version. So, given a table with these columns, one can query for the series in any IDC version.

vkt1414 commented 1 month ago

Thanks @bcli4d. That's very helpful info.

fedorov commented 1 month ago

I don't understand why you want to exclude series in the current release, rather that a table of series across all releases.

@bcli4d the information about the series in the current release is included already in the main index within this package. The purpose of this query is to be able to resolve crdc_series_uuid for the series that are not present in the current release. It augments the main index just for the purposes of supporting download of items from manifests that refer to prior IDC releases.

For additional data such as Modality, we need to build a table by doing a union across all versions of dicom_all of selected data, joining it with the above.

Yes, that's what I thought. If you don't have that ready, I can figure it out.

Let's say someone cares about reproducibility and was interested in knowing the version of data they are working with. Per our documentation, this answer is currently answered by calling get_idc_version().

@vkt1414 the main purpose of the additional delta table is to support download of items referenced from manifests from prior releases. I do not understand what you are referring in the context of reproducibility. For all purposes, the user will work with the main index, which is aligned with the IDC data release in a given package version. The delta index will be separate and will only be useful in case the user wants to know what changed across the versions.

As for the query, I would simply have one table. i.e we could change our idc-index main query to capture every series, pointing to the highest idc version, wherever applicable. I'll write the query and update this

No, I definitely do not want to have a single table that mixes multiple versions. This goes against the way IDC data is organized in BigQuery. idc_current corresponds to a single version. idc-index main index is extracted from the specific version. The user always works with a single version of data. delta index main purpose is to support downloads and sorting of the downloaded files into folders.

bcli4d commented 1 month ago

@bcli4d the information about the series in the current release is included already in the main index within this package. The purpose of this query is to be able to resolve crdc_series_uuid for the series that are not present in the current release. It augments the main index just for the purposes of supporting download of items from manifests that refer to prior IDC releases.

You can only exclude the series that are new in the current release. Any series that were in the previous release but unchanged in the current release must be included in your table. So the table you have in mind will have most, but not all, of the series in the current release. Seems like that could be confusing.

fedorov commented 1 month ago

Interesting. The result of that query contains only 56285 rows, which seems to mean that it does not include most or all of the items from the current release. Maybe we should sit together to define the query to fit my needs?

bcli4d commented 1 month ago

Yes, that's what I thought. If you don't have that ready, I can figure it out. No, I don't.

vkt1414 commented 1 month ago

I'm able to generate the list of seriesInstanceUIDs along with strictly necessary metadata completely using public facing tables.

Here are my observations during this process:

  1. In the past, there was idc-open-idc bucket in use, however, it looks like it is no longer in use. It is not present in idc-dev-etl.idc_v18_dev.all_joined_public table. So if someone generated a manifest from the time this bucket was in use, they would not be able to download data.
  2. After excluding the data in idc-open-idc bucket, I found that there is no discrepancy in total number (5465 series) of series in public tables and idc-dev-etl.idc_v18_dev.all_joined_public table. However, for 93 seriesInstanceUIDs in these 5465 series, there are 186 rows. This seems to have resulted from renaming the collection_id apollo to apollo_5_lscc , and overwriting some files in cptac_gbm collection. Please see the csv below. As these 93 series did have different bucket locations, in my opinion, we must include these as there may have been manifests using both collection names, and hence I'm not sure if idc-dev-etl.idc_v18_dev.all_joined_public table is complete. bquxjob_4c75d7e5_1910baae51f.csv
  3. @bcli4d
SELECT max(idc_version)
  FROM 
 `bigquery-public-data.idc_current.version_metadata`

this table did not seem to have been updated to v18.


As for implementing this feature, I submitted a PR to idc-index-data repo to write the results of this query ( see here). Once we have it available, we can use union_by_name in duckdb to union the main index, and then we can download data from any public facing idc-version ever existed.

query to find series not present in v18:

select
distinct series_instance_uid from `idc-dev-etl.idc_v18_dev.all_joined_public`
where
series_instance_uid not in (select distinct seriesInstanceUID from bigquery-public-data.idc_current.dicom_all)

query to find series along with necessary metadata for implementing this feature: Returns 5558 rows but contains only 5465 series as 93 series' partial metadata is missing from idc-dev-etl.idc_v18_dev.all_joined_public

-- Step 1: Declare variables
DECLARE idc_versions ARRAY<INT64>;
DECLARE latest_idc_version INT64;
DECLARE union_all_query STRING;

-- Step 2: Get all idc_versions
SET idc_versions = (
  SELECT [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
  --SELECT ARRAY_AGG(idc_version)
  --FROM 
  --`bigquery-public-data.idc_current.version_metadata`
);

SET latest_idc_version = (
  SELECT 18
  --SELECT max(idc_version)
  --FROM 
  --`bigquery-public-data.idc_current.version_metadata`
);

-- Step 3: Generate the UNION ALL query dynamically
SET union_all_query = (
  SELECT STRING_AGG(
    FORMAT("""
    SELECT 
    %d AS idc_version, 
    collection_id,
    PatientID,
    SeriesInstanceUID,
    StudyInstanceUID,
    Modality,
    regexp_extract(gcs_url, 'gs://([^/]+)/') as gcs_bucket,
    crdc_series_uuid,
    ROUND(SUM(SAFE_CAST(instance_size AS float64))/1000000, 2) AS series_size_MB,
  FROM
  `bigquery-public-data.idc_v%d.dicom_all` AS dicom_all
  where SeriesInstanceUID not in (select distinct seriesInstanceUID from `bigquery-public-data.idc_v%d.dicom_all`)
  GROUP BY
  1,2,3,4,5,6,7,8

""", version, version, latest_idc_version),
    " UNION ALL "
  )
  FROM UNNEST(idc_versions) AS version
);

-- Step 4: Execute the complete query
EXECUTE IMMEDIATE FORMAT("""
WITH all_versions AS (
  %s
)
SELECT  
  collection_id,
  PatientID,
  SeriesInstanceUID,
  StudyInstanceUID,
  Modality,
  gcs_bucket,
  crdc_series_uuid,
  series_size_MB,
  CASE 
  WHEN gcs_bucket='public-datasets-idc' THEN CONCAT('s3://','idc-open-data/',crdc_series_uuid, '/*')
  WHEN gcs_bucket='idc-open-idc1' THEN CONCAT('s3://','idc-open-data-two/',crdc_series_uuid, '/*')
  WHEN gcs_bucket='idc-open-cr' THEN CONCAT('s3://','idc-open-data-cr/',crdc_series_uuid, '/*')
    END AS series_aws_url,
  MIN(idc_version) AS min_idc_version,
  MAX(idc_version) AS max_idc_version
FROM all_versions
where gcs_bucket not in ('idc-open-idc')
--and seriesInstanceUID not in (select
--distinct series_instance_uid from `idc-dev-etl.idc_v18_dev.all_joined_public`
--where
--series_instance_uid not in (select distinct seriesInstanceUID from bigquery-public-data.idc_current.dicom_all))
GROUP BY   
 1,2,3,4,5,6,7,8
""", union_all_query
);
fedorov commented 1 month ago

In the past, there was idc-open-idc bucket in use, however, it looks like it is no longer in use. It is not present in idc-dev-etl.idc_v18_dev.all_joined_public table. So if someone generated a manifest from the time this bucket was in use, they would not be able to download data.

@bcli4d can you confirm if this is correct?

vkt1414 commented 1 month ago

It turns out that seriesInstanceUID is not reliable as multiple crdc_series_uuid's have the same seriesInstanceUID.

so there are now a total of 45004 crdc_series_uuids not present in current dicom_all a/c to public tables, but there are 45016 crdc_series_uuids a/c to idc-dev-etl.idc_v18_dev.all_joined_public

I'm investigating about those 12 discrepant series uuids.

ccb73a3d-d65b-442e-91c0-212621480366 395bb20e-4def-4ab1-b4ed-d04afa87ded4 bb71dba7-9d06-4a14-bff9-b267748b9655 1eafaa1c-5669-4761-be1d-232d5c02fa4b a81d3b56-a228-471f-8f75-f5b55c49e556 4ff61914-9e6c-46d3-b66a-370d68e65f2e bbebe295-47c4-471a-904c-64ccaf54ad59 3ec3c3a4-b69f-4fa3-99fc-d35d01101a1f 3c0fcc59-6c2e-41c6-98cc-1f9ee6833b96 0b8f5238-9919-470b-a9f4-5260b1e72f41 6636b629-177e-4c26-b2d9-6b7b9f9ea5b9 85a19884-70cc-4e0b-9cee-6341afe0116a

The revised query to find crdc_series_uuids not present in current dicom_all using idc-dev-etl.idc_v18_dev.all_joined_public

select
distinct se_uuid
 from `idc-dev-etl.idc_v18_dev.all_joined_public`
where
se_uuid not in (select distinct crdc_series_uuid from bigquery-public-data.idc_current.dicom_all)

The revised query (uses crdc_series_uuid instead of seriesInstanceUID to determine changes) to find series along with necessary metadata for implementing this feature:

-- Step 1: Declare variables
DECLARE idc_versions ARRAY<INT64>;
DECLARE latest_idc_version INT64;
DECLARE union_all_query STRING;

-- Step 2: Get all idc_versions
SET idc_versions = (
  SELECT [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
  --SELECT ARRAY_AGG(idc_version)
  --FROM 
  --`bigquery-public-data.idc_current.version_metadata`
);

SET latest_idc_version = (
  SELECT 18
  --SELECT max(idc_version)
  --FROM 
  --`bigquery-public-data.idc_current.version_metadata`
);

-- Step 3: Generate the UNION ALL query dynamically
SET union_all_query = (
  SELECT STRING_AGG(
    FORMAT("""
    SELECT 
    %d AS idc_version, 
    collection_id,
    PatientID,
    SeriesInstanceUID,
    StudyInstanceUID,
    Modality,
    regexp_extract(gcs_url, 'gs://([^/]+)/') as gcs_bucket,
    crdc_series_uuid,
    ROUND(SUM(SAFE_CAST(instance_size AS float64))/1000000, 2) AS series_size_MB,
  FROM
  `bigquery-public-data.idc_v%d.dicom_all` AS dicom_all
  where crdc_series_uuid not in (select distinct crdc_series_uuid from `bigquery-public-data.idc_v%d.dicom_all`)
  GROUP BY
  1,2,3,4,5,6,7,8

""", version, version, latest_idc_version),
    " UNION ALL "
  )
  FROM UNNEST(idc_versions) AS version
);

-- Step 4: Execute the complete query
EXECUTE IMMEDIATE FORMAT("""
WITH all_versions AS (
  %s
)
SELECT  
  collection_id,
  PatientID,
  SeriesInstanceUID,
  StudyInstanceUID,
  Modality,
  gcs_bucket,
  crdc_series_uuid,
  series_size_MB,
  CASE 
  WHEN gcs_bucket='public-datasets-idc' THEN CONCAT('s3://','idc-open-data/',crdc_series_uuid, '/*')
  WHEN gcs_bucket='idc-open-idc1' THEN CONCAT('s3://','idc-open-data-two/',crdc_series_uuid, '/*')
  WHEN gcs_bucket='idc-open-cr' THEN CONCAT('s3://','idc-open-data-cr/',crdc_series_uuid, '/*')
    END AS series_aws_url,
  MIN(idc_version) AS min_idc_version,
  MAX(idc_version) AS max_idc_version
FROM all_versions
where gcs_bucket not in ('idc-open-idc')
--and seriesInstanceUID not in (select
--distinct series_instance_uid from `idc-dev-etl.idc_v18_dev.all_joined_public`
--where
--series_instance_uid not in (select distinct seriesInstanceUID from bigquery-public-data.idc_current.dicom_all))
GROUP BY   
 1,2,3,4,5,6,7,8
""", union_all_query
);
bcli4d commented 1 month ago

There is a lot to digest here, but a couple of quick comments:

  1. idc-open-idc was our public bucket before we moved most data to the Google owned public-datasets-idc. We decided at the time to not touch BQ. To deal with this and other cases where some metadata can change (Licences), we include the mutable_metadata table which maps crdc_instance_uuid to current gcs_url, aws_url, license, doi.
  2. Re: "It turns out that seriesInstanceUID is not reliable as multiple crdc_series_uuid's have the same seriesInstanceUID." That is as it should be. Each version of a series (each version os a SeriesInstanceUID) is assigned a new crdc_series_uuid. Thus there are one or more crdc_series_uuids per SeriesInstanceUID.
  3. Re: "this table did not seem to have been updated to v18.". Oops!. Mea culpa.

I don't have time now to study the SQL above. In general, I recommend using all_joined_public joined with a table of (crdc_series_uuid, Modality, whatever).

fedorov commented 1 month ago

Once the aforementioned PR is integrated, I propose the following.

We need to support download by either IDs that may correspond to content that can change across versions (collection ID, patient ID, study/series UID) (DICOM identifiers), or by series_instance_uuid from the manifest (UUID).

I think we should treat the new prior_versions_index as a separate dataframe, alongside with the main index. Whenever there is a request to download an item, we first check the main index. If it is not located in the main index, we inform the user that what is requested is not available in the latest version of IDC, and try prior_versions_index as a fallback. If the item is found, we download and notify the user about the version it is coming from. If user requested item by a DICOM identifier, we fetch the most recent version of the files corresponding to that identifier. If request is by UUID, then there is no ambiguity.

fedorov commented 1 month ago

@vkt1414 can you please let me know if you have any concerns or questions about the above?

vkt1414 commented 1 month ago

I don't have any problem with the overall idea.

Implementation is more convoluted than one may think, as this touches a lot of methods. I'm working on this in my fork, and have made progress. Downloading a manifest with previous idc-data works, under default settings. Working on s5cmd sync now, and then plan to work on download by uuid.

Current implementation of filtering is not ideal, https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L114 does not care if some values passed are invalid. Will fix that and query previous index when needed.

fedorov commented 1 month ago

We don't necessarily need to update all methods at once. I would first update the calls that process download from manifest and resolve CRDC UUIDs. In the following improvements we can handle DICOM UIDs. This would also make code review easier.