chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

[python] cellxgene-census context handling no longer allows reading buckets outside of us-west-2 #908

Closed bkmartinjr closed 10 months ago

bkmartinjr commented 10 months ago

As of the 1.9 release, the context building in cellxgene-census has broken the ability to pass a user-specified context with a s3 region != us-west-2. The code in _build_soma_tiledb_context() clobbers the region the user has specified in the context.

Example (using an array that is in a bucket in us-east-1), soma.open works fine, but our code does not:

In [19]: context=soma.options.SOMATileDBContext(tiledb_config={"vfs.s3.region":"us-east-1"})

In [20]: soma.open(uri, context=context)
Out[20]: <SparseNDArray 's3://czi.bruce-public/embed/CxG-contrib-5/' (open for 'r')>

In [21]: cellxgene_census.experimental.get_embedding_metadata(embedding_uri='s3://czi.bruce-public/embed/CxG-contrib-5',context=context)
---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
Cell In[21], line 1
----> 1 cellxgene_census.experimental.get_embedding_metadata(embedding_uri='s3://czi.bruce-public/embed/CxG-contrib-5',context=context)

[...snip...]

TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://czi.bruce-public/embed/CxG-contrib-5/__schema/' and delimiter '/'[Error Type: 100] [HTTP Response Code: 301] [Exception: PermanentRedirect] [Remote IP: 52.218.229.128] [Request ID: MHBH13SW1C1PK4SF] [Headers: 'content-type' = 'application/xml' 'date' = 'Wed, 20 Dec 2023 17:40:51 GMT' 'server' = 'AmazonS3' 'transfer-encoding' = 'chunked' 'x-amz-bucket-region' = 'us-east-1' 'x-amz-id-2' = 'ZvuoiXDCftJh9qEMzsmtwj6u6U/49gzog5SNmCZshyUM17JpUbmjlohuP/PgzrqJ1czl4jnXk4g=' 'x-amz-request-id' = 'MHBH13SW1C1PK4SF'] : Unable to parse ExceptionName: PermanentRedirect Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

versions:

In [23]: soma.show_package_versions()
tiledbsoma.__version__        1.6.1
TileDB-Py tiledb.version()    (0, 24, 0)
TileDB core version           2.18.2
libtiledbsoma version()       libtiledb=2.18.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1017-aws

In [24]: cellxgene_census.__version__
Out[24]: '1.9.1'
atolopko-czi commented 10 months ago

The associated code in 1.9 is from May, so this is a long-standing bug, apparently. However, the issue may already be fixed on main as of 1cdc45023d6f63407ccf50171308023af22392ff, or if not, possibly by #902. I will add unit tests to ensure the the vfs.s3.region config setting is not being clobbered when specifying a uri in open_soma()

bkmartinjr commented 10 months ago

Fixed by #902