chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

Incorrect size of result when passing `obs_coords` #1157

Open ivirshup opened 1 month ago

ivirshup commented 1 month ago

Describe the bug

There are a number of tests that look like:

adata = cellxgene_census.get_anndata(
    census,
    organism="Mus musculus",
    obs_coords=slice(10000),
    var_value_filter="feature_id == 'ENSMUSG00000069581'",
)
assert adata.n_obs == 10001

However, this goes against normal python interpretation of slices. For instance:

assert len(list(range(100))[slice(10)]) == 10
# or
import numpy as np
assert len(np.arange(100)[slice(10)]) == 10

get_anndata, and presumably other methods, seem to be giving an extra point back when using a slice for the {obs,var}_coords arguments.

Environment

Provide a description of your system and the software versions.

``` Package Version Editable project location ------------------ -------------------- ------------------------------------------------------------------- accelerate 0.30.1 accumulation_tree 0.6.2 aiobotocore 2.13.0 aiohttp 3.9.5 aioitertools 0.11.0 aiosignal 1.3.1 anndata 0.10.7 appnope 0.1.4 array_api_compat 1.6 asttokens 2.4.1 attrs 23.2.0 autopep8 2.1.0 backports.tarfile 1.1.1 botocore 1.34.106 cellxgene-census 1.13.2.dev7+gaa698f3 /Users/ivirshup/github/cellxgene-census/api/python/cellxgene_census certifi 2024.2.2 cfgv 3.4.0 charset-normalizer 3.3.2 click 8.1.7 comm 0.2.2 contourpy 1.2.1 coverage 7.5.1 cycler 0.12.1 datasets 2.19.1 debugpy 1.8.1 decorator 5.1.1 dill 0.3.8 distlib 0.3.8 docutils 0.21.2 execnet 2.1.1 executing 2.0.1 filelock 3.14.0 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.3.1 geneformer 0.0.1 h5py 3.11.0 huggingface-hub 0.23.0 identify 2.5.36 idna 3.7 importlib_metadata 7.1.0 iniconfig 2.0.0 ipykernel 6.29.4 ipython 8.24.0 jaraco.classes 3.4.0 jaraco.context 5.3.0 jaraco.functools 4.0.1 jedi 0.19.1 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 jupyter_client 8.6.1 jupyter_core 5.7.2 keyring 25.2.1 kiwisolver 1.4.5 legacy-api-wrap 1.4 llvmlite 0.42.0 loompy 3.0.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 more-itertools 10.2.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 natsort 8.4.0 nbqa 1.8.5 nest-asyncio 1.6.0 networkx 3.3 nh3 0.2.17 nodeenv 1.8.0 numba 0.59.1 numpy 1.26.4 numpy-groupies 0.11.1 owlready2 0.46 packaging 24.0 pandas 2.2.2 parso 0.8.4 patsy 0.5.6 pexpect 4.9.0 pillow 10.3.0 pip 24.0 pkginfo 1.10.0 platformdirs 4.2.2 pluggy 1.5.0 pre-commit 3.7.1 prompt-toolkit 3.0.43 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 12.0.1 pyarrow-hotfix 0.6 pycodestyle 2.11.1 Pygments 2.18.0 pynndescent 0.5.12 pyparsing 3.1.2 pytest 8.2.0 pytest-xdist 3.6.1 python-dateutil 2.9.0.post0 pytz 2024.1 pyudorandom 1.0.0 PyYAML 6.0.1 pyzmq 26.0.3 readme_renderer 43.0 regex 2024.5.15 requests 2.31.0 requests-mock 1.12.1 requests-toolbelt 1.0.0 rfc3986 2.0.0 rich 13.7.1 s3fs 2024.3.1 safetensors 0.4.3 scanpy 1.10.1 scikit-learn 1.4.2 scikit-misc 0.3.1 scipy 1.13.0 seaborn 0.13.2 session_info 1.0.0 setuptools 69.5.1 six 1.16.0 somacore 1.0.10 stack-data 0.6.3 statsmodels 0.14.2 stdlib-list 0.10.0 sympy 1.12 tdigest 0.5.2.2 threadpoolctl 3.5.0 tiledb 0.27.1 tiledbsoma 1.9.5 tokenize-rt 5.2.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.2.2 torchdata 0.7.1 tornado 6.4 tqdm 4.66.4 traitlets 5.14.3 transformers 4.41.0 twine 5.1.0 typing_extensions 4.11.0 tzdata 2024.1 umap-learn 0.5.6 urllib3 2.2.1 virtualenv 20.26.2 wcwidth 0.2.13 wheel 0.43.0 wrapt 1.16.0 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.2 ```
prathapsridharan commented 1 month ago

@ivirshup @pablo-gar @ebezzi - If you trace the code down, I believe you end up calling obs.read() somewhere and that is the read() method on DataFrame class in SOMA. In that class, the docs for read() say that _"Slices are doubly inclusive"_

So from the specification I don't think this is a bug but I understand that it certainly seems like one from the user's perspective and we should investigate why slices are specified as doubly inclusive and violates the pythonic meaning of slices where the right endpoint is not included.

I think there is probably a good reason for it but the justification for the design choice is not written anywhere (at least I can't see it) and therefore we will need to dig in and derive it

pablo-gar commented 1 month ago

@ivirshup in you opinion is dissonance is large enough to warrant a change in the of the parameter design?

ivirshup commented 1 month ago

Yes.

Is this the behaviour of tiledb or is it specific to tiledb-soma? (update: AFAICT this is only tiledb-soma)

prathapsridharan commented 1 month ago

I think it is a tiledb constraint.

Unlike NumPy array indexing, multi_index respects TileDB’s range semantics: slice ranges are inclusive of the start- and end-point, and negative ranges do not wrap around (because a TileDB dimensions may have a negative domain)

It is worth asking the tiledb team for a more detailed explanation for using this semantics of slices (I think most languages use the open-right-end-point semantics).