ImagingDataCommons / idc-index-data

Python package providing the index to query and download data hosted by the NCI Imaging Data Commons
MIT License
1 stars 4 forks source link

include gcs urls similar to aws urls in the index #14

Closed vkt1414 closed 4 months ago

vkt1414 commented 5 months ago

@DanielaSchacherer noted that it would also be great to have gcs URLs in the index. This will allow us to provide an option for the user to pick the platform of their choice.

@fedorov your thoughts?

jcfr commented 5 months ago

Currently the AWS and GCP endpoints are set in idc_index/index.py^1. Then in the idc_index.index.download_from_manifest function, the GCP endpoint is used^2 if it failed to download data from AWS.

Few questions ...

If data are available only in one of the storage, updating the index to indicate which storage should be used to download the data would indeed be sensible.

vkt1414 commented 5 months ago

Are AWS and GCP storage mirror from each other ?

Yes, data in GCP buckets are cloned to AWS buckets. So, we maintain two copies of the data.

I'll try my best to explain how the data is stored. For about a year now, all data has been stored in folders at the series level. See below for an example where crdc_series_uuid is the unique id assigned by CancerResearchDataCommons I believe for a series folder name, and similarly crdc_instance_uuid for individual DICOM files.

bucket

So, the URLs are the same in GCP and AWS from the folder level. However, the buckets are named differently in GCP and AWS. To sum it up GCP and AWS URLs for all DICOM files differ only in the prefixes of the URLs. gs://bucket1/crdc_series_uuid/crdc_instance_uuid.dcm s3://bucket2/crdc_series_uuid/crdc_instance_uuid.dcm

What makes identifying a GCS vs AWS URL complicated is that s5cmd requires any s3 style buckets to have s3:// as the prefix. So, we must modify the GCS URLs from gs:// to s3:// to make them work with s5cmd. When a user downloads a manifest from the IDC portal, all URLs start with s3:// and there is no way to know which endpoint to use by just looking at URLs. So, what Andrey did in [2] is to query for contents of a folder with s5cmd ls to see if we get any response back, to determine the correct endpoint.

So, all we need to know is the bucket name in GCS and we can build GCS URLs from AWS URLs to be efficient, but we may want to have them directly available instead of expecting the user to build GCS URLs themselves.

DanielaSchacherer commented 5 months ago

So, the URLs are the same in GCP and AWS from the folder level. However, the buckets are named differently in GCP and AWS. To sum it up GCP and AWS URLs for all DICOM files differ only in the prefixes of the URLs. gs://bucket1/crdc_series_uuid/crdc_instance_uuid.dcm s3://bucket2/crdc_series_uuid/crdc_instance_uuid.dcm

But isn't bucket1 = public-datasets-idc and bucket2 = idc-open-data in all cases and we could just hardcode this?

vkt1414 commented 5 months ago

Unfortunately, not always!

image image

DanielaSchacherer commented 5 months ago

Ok, I see.

pieper commented 5 months ago

I understood we prefer aws urls since they provide usage data and that there's no cost difference to users (either should be free). So is there a performance argument for giving gcs urls? It would be a shame if adding gcs urls makes the package a lot bigger.

fedorov commented 5 months ago

I am against adding GCS URLs.

  1. It will significantly add to the size
  2. It will not add any new functionality - data is the same in AWS
  3. We discourage downloads from GCS because we cannot track usage from GCS
  4. It is just one new column, but it will add to the overall complexity of the index. It's a slippery slope.
vkt1414 commented 5 months ago

The impact of adding AWS URLs seems minimal for now. The raw CSV is 185 vs 214 MB, and the compressed CSV zip is 48 MB vs 49.1 MB after adding the GCS URLs. But I agree that the index will keep increasing with every release and we must be cautious.

@jcfr, a question for you: The v18 compressed CSV index will be at 75 MB and the parquet is 94 MB. To make the switch to parquet, I believe we will be at pypi's borderline limit of 100 MB. What are our options to be sustainable going forward?

fedorov commented 5 months ago

The impact of adding AWS URLs seems minimal for now.

Of course it would be minimal. But every extra column will have impact, and should only be added only if absolutely necessary.

What are our options to be sustainable going forward?

I suggest we strive to keep idc-index under 100MB, dropping columns if necessary, and download additional indices on demand at runtime, without packaging them. I don't have any better ideas now, but very interested in your opinion @jcfr.

jcfr commented 5 months ago

to provide an option for the user to pick the platform of their choice

Why do we think the user care about where the data are organized ? I would think this is an implementation details.


selection of endpoint

If providing a way for the user to select a specific platform for retrieving the data is sensible, it should probably be done through the idc-index package API.

Based on https://github.com/ImagingDataCommons/idc-index-data/issues/14#issuecomment-2035093705, it looks like there is no need to explicitly provide such an option.

That said, since we have full control on how the idc-index-data is used, the time being, generating the correct GCS query from within idc-index-data seems reasonable.


cvs.zip vs parquet

Why should we distribute both ? Looks like moving forward, we could only distribute .parquet and provide a convenient a way to generate the .csv.zip through a function in idc-index (or idc-index-data


parquet and csv compression

Moving away from zip and looking into zstd^1 may be sensible here:

It helps reduce the size:

$ ls -1ash idc_index*
 31M idc_index-10.parquet.zst
 31M idc_index-15.parquet.zst
 28M idc_index-22.parquet.zst
 33M idc_index-3.parquet.zst
177M idc_index.csv
 46M idc_index.csv.zip
 41M idc_index.csv.zst

based on the following code:

import pandas as pd

file_path = "/tmp/idc_index_data-17.0.0-py3-none-any/idc_index_data/idc_index.csv.zip"
index_df = pd.read_csv(file_path, dtype=str, encoding="utf-8")

csv_file_name=f"/tmp/idc_index_data-17.0.0-py3-none-any/idc_index_data/idc_index.csv.zst"
print(f"csv_file_name [{csv_file_name}]")
index_df.to_csv(csv_file_name, compression={"method": "zstd"}, escapechar="\\")

for level in [3, 10, 15, 22]:
    parquet_file_name=f"/tmp/idc_index_data-17.0.0-py3-none-any/idc_index_data/idc_index-{level}.parquet.zst"
    print(f"parquet_file_name [{parquet_file_name}]")
    index_df.to_parquet(parquet_file_name, compression="zstd", compression_level=level)

Note that compression_level does not seem to be supported with to_csv

vkt1414 commented 5 months ago

Why do we think the user care about where the data are organized ? I would think this is an implementation details.

It's only for the advanced users who want to explore from the index table itself when API options provided by idc-index may not be sufficient for their use cases. For example, how Daniela stumbled upon, where she wanted to use GCS URLs to easily transition her notebooks from bigquery to idc-index. But I think we cannot accommodate this until we figure out the pypi limits.

cvs.zip vs parquet Why should we distribute both ? Looks like moving forward, we could only distribute .parquet and provide a convenient a way to generate the .csv.zip through a function in idc-index (or idc-index-data

Sorry for the confusion, we want to switch from CSV to Parquet. So, there will be no csv file at all. duckdb integrates extremely well with parquet as highlighted here. For the pathology index, I have proposed to use the nested index as pathology folks require some instance-level attributes and duckdb gives a run for its money against big query in terms of its ability to handle complex data types. I plan to do a performance comparison between csv and parquet as well.

https://duckdb.org/2021/05/14/sql-on-pandas.html https://www.christophenicault.com/post/large_dataframe_arrow_duckdb/

@jcfr this is brilliant!

index_df.to_parquet(parquet_file_name, compression="zstd", compression_level=level)

I tried with v18 and level 22 compressed parquet with zstd is 46 MB now vs 96 MB with default snappy compression that pandas uses when writing to parquet. The pathology index that I'm currently working on is over 200 MB with snappy, but it is just 11 MB compressed with zstd. Thank you!

I guess this zstd will keep us happy for a few more releases but I'm curious how can we leverage GitHub release attachments again where we store up to 2 GB/file? I guess this will mean we will have to publish only source releases and not wheels. But I also do not want to increase the complexity by expecting the user to have access to bigquery.

fedorov commented 5 months ago

For example, how Daniela stumbled upon, where she wanted to use GCS URLs to easily transition her notebooks from bigquery to idc-index.

@DanielaSchacherer what are the downsides of using aws_url?

DanielaSchacherer commented 5 months ago

@fedorov right now the main downside for me is that it requires multiple changes to my utility code in idc-comppath-reproducibility, which I was hoping to be able to keep the way it was. Also, I usually found it more consistent having the notebook and the data from Google.

I know those are not strong arguments in this discussion. I wasn't aware of the complications that GCS URLs would bring along for idc-index.

fedorov commented 5 months ago

I think it is better to change the utility code. It will have to evolve over time anyway. We can tag the current version with a release that accompanies paper - for reproducibility.

vkt1414 commented 4 months ago

I'll close this issue.

download additional indices on demand at runtime, without packaging them.

If we ever cross pypi's 100 MB limit, I agree that we can use github release attachments to store large files. Looking way more into the future, if we cross the release attachment's 2 GB/file limit, duckdb has us covered.

import duckdb
sql='''
COPY
    (FROM '/content/parquetfiles/parquet_export/*.parquet')
    TO 'pivot.parquet'
    (FORMAT 'parquet', COMPRESSION 'zstd', file_size_bytes '1.9GB');
'''
df=duckdb.query(sql)

A simple query as this can write a large parquet file into approximately 2 GB files. We can download multiple files if needed and duckdb can query across files as if a single file.