chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

The APIs `download_source_h5ad`/`get_source_h5ad_uri` must get fetch h5ads with Discover API #616

Closed pablo-gar closed 9 months ago

pablo-gar commented 1 year ago

Blocked by #6 and #548. Related to #89

Once dataset_version_id is included in the Census data, then the APIs download_source_h5ad/get_source_h5ad_uri must use ~the Discover API~ GET from CloudFront instead of H5AD copies currently deposited in the Census S3 bucket (see #289)

Tasks:

bkmartinjr commented 1 year ago

Why not just remove / deprecate these API? I.e., what convenience do they provide over the Discover API?

If you do elect to retain them, there are other issues to be resolved, e.g., the return type assumes it can return any URI scheme (e.g., s3://...), but with this change it will only return https://.... Probably should simplify the function to return only a URI string

See related notes in #89 - eg, we likely need a schema change as well.

pablo-gar commented 1 year ago

@atolopko-czi did a retroactive analysis on the S3 logs for the Census bucket h5ads. He showed that, since May 2023:

There have been 959,924 H5AD downloads:

These come from a total of 25 unique IPs, top 13:

remoteip    count(1)
134.174.140.55  192446
34.135.174.4    98065
35.238.225.20   91556
150.230.44.69   87392
165.1.79.188    85876
150.230.148.164 82975
173.82.202.31   54713
138.2.237.78    35299
54.151.103.146  7688
138.2.224.226   7662
74.125.77.23    4920
74.125.77.22    4755
74.125.77.21    4308

These data show that users (albeit a small number) likely use these APIs (download_source_h5ad/get_source_h5ad_uri) for programatic access to H5ADs. With this information I'm inclined to keep maintaining the APIs.

bkmartinjr commented 9 months ago

Obsoleted by #935 - will not fix.