chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

KeyError on download_source_h5ad with Valid Dataset ID in cellxgene_census #1100

Closed ubyndr closed 3 months ago

ubyndr commented 3 months ago

Describe the bug

When attempting to download a dataset using the cellxgene_census Python library, the function download_source_h5ad fails with a KeyError indicating an 'Unknown dataset_id'. The dataset ID used does exist as per the URL provided, suggesting a possible issue with the dataset ID mapping or retrieval process within the library.

To Reproduce

Steps to reproduce the behavior:

  1. Import the cellxgene_census library in Python.
  2. Attempt to download the dataset with the following code:
    
    import cellxgene_census
    cellxgene_census.download_source_h5ad("0500e103-38db-456d-9c3f-b96b8a693ab2", o_path="0500e103-38db-456d-9c3f-b96b8a693ab2_.h5ad")
  3. Observe the KeyError on execution.

Expected behavior

The expected behavior is that the dataset with ID 0500e103-38db-456d-9c3f-b96b8a693ab2 should download successfully without errors, saving the file to the specified output path 0500e103-38db-456d-9c3f-b96b8a693ab2_.h5ad

Environment

Provide a description of your system and the software versions.


Package                   Version
------------------------- --------------
aiobotocore               2.12.0
aiohttp                   3.9.3
aioitertools              0.11.0
aiosignal                 1.3.1
airium                    0.2.6
anndata                   0.9.2
annotated-types           0.6.0
antlr4-python3-runtime    4.9.3
anyio                     4.3.0
appdirs                   1.4.4
appnope                   0.1.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
array_api_compat          1.4.1
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
async-timeout             4.0.3
attrs                     23.2.0
awscli                    1.32.53
Babel                     2.14.0
bcp47                     0.0.4
beautifulsoup4            4.12.3
binaryornot               0.4.4
black                     24.2.0
bleach                    6.1.0
botocore                  1.34.51
cas-tools                 0.0.1.dev32
cattrs                    23.2.3
cell-annotation-schema    0.2b0
cellxgene-census          1.13.0
certifi                   2024.2.2
cffi                      1.16.0
chardet                   5.2.0
charset-normalizer        3.3.2
class_resolver            0.4.3
click                     8.1.7
colorama                  0.4.4
comm                      0.2.1
contourpy                 1.2.0
cookiecutter              2.6.0
curies                    0.7.9
cycler                    0.12.1
dataclasses-json          0.6.4
debugpy                   1.8.1
decorator                 5.1.1
deepmerge                 1.1.0
defusedxml                0.7.1
Deprecated                1.2.14
deprecation               2.1.0
docutils                  0.16
et-xmlfile                1.1.0
eutils                    0.6.0
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.1
fastobo                   0.12.3
fonttools                 4.49.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2024.2.0
funowl                    0.2.3
get-annotations           0.1.2
h11                       0.14.0
h5py                      3.10.0
hbreader                  0.9.1
httpcore                  1.0.4
httpx                     0.27.0
idna                      3.6
ijson                     3.2.3
importlib-metadata        7.0.1
importlib_resources       6.1.2
iniconfig                 2.0.0
ipykernel                 6.29.3
ipython                   8.18.1
ipython-genutils          0.2.0
ipywidgets                8.1.2
isodate                   0.6.1
isoduration               20.11.0
isort                     5.13.2
jedi                      0.19.1
Jinja2                    3.1.3
jmespath                  1.0.1
joblib                    1.3.2
json-flattener            0.1.9
json5                     0.9.17
jsonasobj                 1.3.1
jsonasobj2                1.0.4
jsonpointer               2.4
jsonschema                4.4.0
jsonschema-specifications 2023.12.1
jsonschema2md             1.1.0
jupyter                   1.0.0
jupyter_client            8.6.0
jupyter-console           6.6.3
jupyter_core              5.7.1
jupyter-events            0.9.0
jupyter-lsp               2.2.3
jupyter_server            2.12.5
jupyter_server_terminals  0.5.2
jupyterlab                4.1.2
jupyterlab_pygments       0.3.0
jupyterlab_server         2.25.3
jupyterlab_widgets        3.0.10
kgcl-rdflib               0.5.0
kgcl_schema               0.6.8
kiwisolver                1.4.5
lark                      1.1.9
linkml-renderer           0.3.0
linkml-runtime            1.7.5
llvmlite                  0.42.0
lxml                      5.2.1
markdown-it-py            3.0.0
MarkupSafe                2.1.5
marshmallow               3.21.0
matplotlib                3.8.3
matplotlib-inline         0.1.6
mdurl                     0.1.2
mistune                   3.0.2
more-click                0.1.2
multidict                 6.0.5
mypy-extensions           1.0.0
natsort                   8.4.0
nbclient                  0.9.0
nbconvert                 7.16.1
nbformat                  5.9.2
ndex2                     3.8.0
nest-asyncio              1.6.0
networkx                  3.2.1
notebook                  5.7.5
notebook_shim             0.2.4
numba                     0.59.0
numpy                     1.26.4
oaklib                    0.5.25
ols-client                0.1.4
ontoportal-client         0.0.4
openpyxl                  3.1.2
ordered-set               4.1.0
overrides                 7.7.0
packaging                 23.2
pandas                    2.2.1
pandasaurus               0.3.8
pandasaurus-cxg           0.1.11
pandocfilters             1.5.1
pansql                    0.0.1
parso                     0.8.3
pathspec                  0.12.1
patsy                     0.5.6
pexpect                   4.9.0
pillow                    10.2.0
pip                       23.3.1
platformdirs              4.2.0
pluggy                    1.4.0
prefixcommons             0.1.12
prefixmaps                0.2.3
prometheus_client         0.20.0
prompt-toolkit            3.0.43
pronto                    2.5.6
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   12.0.1
pyarrow-hotfix            0.6
pyasn1                    0.5.1
pycparser                 2.21
pydantic                  2.7.0
pydantic_core             2.18.1
Pygments                  2.17.2
pygraphviz                1.11
PyJSG                     0.11.10
pynndescent               0.5.11
pyparsing                 3.1.1
pyrsistent                0.20.0
pysolr                    3.9.0
pystow                    0.5.4
pytest                    8.1.1
pytest-logging            2015.11.4
python-dateutil           2.9.0.post0
python-json-logger        2.0.7
python-slugify            8.0.4
PyTrie                    0.4.0
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     25.1.2
qtconsole                 5.5.1
QtPy                      2.4.1
ratelimit                 2.2.1
rdflib                    6.3.2
rdflib-jsonld             0.6.1
rdflib-shim               1.0.3
referencing               0.33.0
requests                  2.31.0
requests-cache            1.2.0
requests-toolbelt         1.0.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rfc3987                   1.3.8
rich                      13.7.1
rpds-py                   0.18.0
rsa                       4.7.2
ruamel.yaml               0.18.6
ruamel.yaml.clib          0.2.8
s3fs                      2024.2.0
s3transfer                0.10.0
scanpy                    1.9.8
scikit-learn              1.4.1.post1
scipy                     1.12.0
seaborn                   0.13.2
semsimian                 0.2.15
semsql                    0.3.3
Send2Trash                1.8.2
session-info              1.0.0
setuptools                68.2.2
six                       1.16.0
sniffio                   1.3.1
somacore                  1.0.10
sortedcontainers          2.4.0
soupsieve                 2.5
SPARQLWrapper             2.0.0
SQLAlchemy                2.0.29
SQLAlchemy-Utils          0.38.3
sssom                     0.4.7
sssom-schema              0.15.2
stack-data                0.6.3
statsmodels               0.14.1
stdlib-list               0.10.0
terminado                 0.18.0
text-unidecode            1.3
threadpoolctl             3.3.0
tiledb                    0.27.1
tiledbsoma                1.9.3
tinycss2                  1.2.1
tomli                     2.0.1
tornado                   6.4
tqdm                      4.66.2
traitlets                 5.14.1
types-python-dateutil     2.9.0.20240316
typing_extensions         4.10.0
typing-inspect            0.9.0
tzdata                    2024.1
umap-learn                0.5.5
uri-template              1.3.0
url-normalize             1.4.3
urllib3                   1.26.18
validators                0.28.0
wcwidth                   0.2.13
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
wheel                     0.41.3
widgetsnbextension        4.0.10
wrapt                     1.16.0
yarl                      1.9.4
zipp                      3.17.0

Additional context

This issue blocks data download tasks which are critical for downstream analysis. The dataset appears to be available and accessible directly via browser, which indicates a potential issue in the library's URI handling or dataset ID validation logic.

ebezzi commented 3 months ago

Hey @ubyndr,

the dataset is not part of the Census as its assay (snmC-Seq2, ontology term EFO:0030027) is not among the accepted assays.

If you're only interested in downloading its h5ad, you can generate a download link from its collection page in the CELLxGENE Discover portal. For this dataset, you can use this link:

https://datasets.cellxgene.cziscience.com/ef4ead27-5d4f-4c6c-9b44-69aaa049388c.h5ad

Let me know if you have any other question.

ubyndr commented 3 months ago

Thank you for clarifying this. I appreciate it. I don't have any further question.