Out of memory when downloading one dataset

rcannood commented 10 months ago

Describe the bug

When I try to extract an AnnData from CELLxGENE census using the Python interface, the process runs out of memory (>200GB) and crashes.

To Reproduce

The following script runs out of memory and crashes:

import cellxgene_census
import anndata as ad
import tiledbsoma as soma

species = "mus_musculus"
dataset_id = "49e4ffcc-5444-406d-bdee-577127404ba8"

census = cellxgene_census.open_soma()

# this runs out of memory and crashes
ad = cellxgene_census.get_anndata(
    census=census,
    obs_value_filter=f"dataset_id == '{dataset_id}'",
    organism=species
)

However, simply downloading the source h5ad to disk and reading this works quite well.

# download the source h5ad works quite well (about 4GB)
cellxgene_census.download_source_h5ad(dataset_id, to_path="dataset.h5ad")
adata = ad.read_h5ad("dataset.h5ad")
print(f"adata.shape: {adata.shape}") # (301796, 31334)

By digging a little into the implementation of cellxgene_census, tiledbsoma and somacore, I get a general idea of what could be going wrong. When running the query manually, I get the same number of cells as in the source h5ad:

# fetching the query manually
exp = census["census_data"][species]
query = exp.axis_query(
    "RNA",
    obs_query=soma.AxisQuery(value_filter=f"dataset_id == '{dataset_id}'"),
)
n_obs = query.n_obs
n_vars = query.n_vars
print(f"query shape: ({n_obs}, {n_vars})") # (301796, 52392)

When I fetch the obs and the var, I also get reasonable dimensions:

# first get the obs and var
obs = query.obs().concat().to_pandas()
var = query.var().concat().to_pandas()
print(f"obs.shape: {obs.shape}") # (301796, 21)
print(f"var.shape: {var.shape}") # (52392, 4)

However, when I take a look at the X, the number of cells suddenly jumps from 301'796 to 5'255'245.

X = query.X("raw")
print(f"X.shape: {X.shape}") # (5255245, 52392)

Fetching all of the data as a sparse matrix takes a bit of time to fetch all of the data but it works quite well.

Xcoo = X.coos().concat()
print(f"Xcoo.shape: {Xcoo.shape}") # (5255245, 52392)

This leaves me wondering why the number of cells differs depending on whether I'm looking at the .obs or the .X, and whether the .X matrix is being fetched as a dense matrix when querying cellxgene_census.

Am I overlooking something here?

Expected behavior

I'd expect cellxgene_census.get_anndata(...) to return a 301'796 × 52'392 AnnData object with X being a sparse matrix.

Environment

Machine: x86-64 system with 32 threads and 128GB ram OS: Fedora 39 Python: 3.11

Output of pip list

``` $ pip list Package Version Editable project location ------------------------- ------------------- --------------------------------------------------------------------------- aiobotocore 2.7.0 aiohttp 3.8.6 aioitertools 0.11.0 aiosignal 1.3.1 anndata 0.8.0 anyio 4.0.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 array-api-compat 1.4 arrow 1.3.0 asttokens 2.4.0 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.1.0 Babel 2.13.0 backcall 0.2.0 bleach 6.1.0 botocore 1.31.64 cachetools 5.3.1 cellxgene-census 1.9.1.dev1+g3892ef4 /home/rcannood/workspace/cellxgene-census/api/python/cellxgene_census cellxgene-schema 3.1.3 certifi 2023.7.22 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.3 comm 0.1.4 contourpy 1.2.0 coverage 7.2.3 cssutils 2.9.0 cycler 0.12.1 Cython 0.29.34 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 executing 2.0.0 fastjsonschema 2.18.1 fastobo 0.12.2 fonttools 4.44.3 fqdn 1.5.1 frozenlist 1.4.0 fsspec 2023.10.0 h11 0.14.0 h5py 3.10.0 httpcore 0.18.0 httpx 0.25.0 idna 3.4 iniconfig 2.0.0 ipykernel 6.25.2 ipython 8.16.1 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.19.1 jsonschema-specifications 2023.7.1 jupyter 1.0.0 jupyter_client 8.4.0 jupyter-console 6.6.3 jupyter_core 5.4.0 jupyter-events 0.8.0 jupyter-lsp 2.2.0 jupyter_server 2.8.0 jupyter_server_terminals 0.4.4 jupyterlab 4.0.7 jupyterlab-pygments 0.2.2 jupyterlab_server 2.25.0 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 llvmlite 0.40.1 matplotlib 3.8.1 matplotlib-inline 0.1.6 mistune 3.0.2 multidict 6.0.4 natsort 8.4.0 nbclient 0.8.0 nbconvert 7.9.2 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 notebook 7.0.6 notebook_shim 0.2.3 numba 0.57.0 numpy 1.23.2 overrides 7.4.0 Owlready2 0.38 packaging 23.2 pandas 1.4.4 pandocfilters 1.5.0 parso 0.8.3 patsy 0.5.3 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3 platformdirs 3.11.0 pluggy 1.3.0 premailer 3.10.0 prometheus-client 0.17.1 pronto 2.5.5 pure-eval 0.2.2 pyarrow 14.0.1 pyarrow-hotfix 0.6 pynndescent 0.5.10 pyparsing 3.1.1 pytest 7.2.2 python-dateutil 2.8.2 python-json-logger 2.0.7 python-telegram-bot 20.6 pytz 2023.3.post1 PyYAML 6.0 pyzmq 25.1.1 qtconsole 5.4.4 QtPy 2.4.0 referencing 0.30.2 requests 2.31.0 requests-mock 1.11.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rpds-py 0.10.6 ruamel.yaml 0.18.0 s3fs 2023.10.0 scanpy 1.9.6 scikit-learn 1.3.2 scipy 1.11.3 seaborn 0.12.2 semver 3.0.0 Send2Trash 1.8.2 session-info 1.0.0 setuptools 68.0.0 six 1.16.0 sniffio 1.3.0 somacore 1.0.6 stack-data 0.6.3 statsmodels 0.14.0 stdlib-list 0.9.0 tbb 2021.11.0 terminado 0.17.1 threadpoolctl 3.2.0 tiledb 0.24.0 tiledbsoma 1.6.0 tinycss2 1.2.1 tornado 6.3.3 tqdm 4.66.1 traitlets 5.11.2 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 tzdata 2023.3 umap-learn 0.5.4 uri-template 1.3.0 urllib3 2.0.7 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 wheel 0.40.0 widgetsnbextension 4.0.9 wrapt 1.16.0 yagmail 0.15.293 yarl 1.9.2 ```

rcannood commented 10 months ago

It appears that the Xcoo object I'm extracting from the query is the whole cellxgene census, since:

>>> exp.obs.read().concat().to_pandas().shape
(5255245, 21)

I can therefore use the obs["soma_joinid"] to retrieve the expected X:

Xcoos = Xcoo.to_scipy().tocsr()
Xcoos_subset = Xcoos[obs["soma_joinid"]]
print(f"Xcoos_subset.shape: {Xcoos_subset.shape}") # (301796, 52392)

A valid workaround for my issue can therefore be:


def query_to_anndata(query):
    obs = query.obs().concat().to_pandas()
    var = query.var().concat().to_pandas()

    X = query.X("raw").coos().concat().tocsr()
    X_subset = X[obs["soma_joinid"]]

    return ad.AnnData(
        X=X_subset,
        obs=obs,
        var=var
    )

rcannood commented 10 months ago

I can reproduce the issue by running:

docker run --rm -i --memory=20g --entrypoint python ghcr.io/openproblems-bio/datasets/loaders/query_cellxgene_census@sha256:62a0b3ea4beb4432231c28168326cc5eb74b7ee263d09e1eec18c18f56a960ba - << EOF
import cellxgene_census

# this runs out of memory and crashes
with cellxgene_census.open_soma() as census:
  ad = cellxgene_census.get_anndata(
    census=census,
    organism="mus_musculus",
    obs_value_filter=f"dataset_id == '49e4ffcc-5444-406d-bdee-577127404ba8'"
  )

  print(f"AnnData: {ad}", flush=True)
EOF

echo "Exit code: $?"

Which results in an exit code 137 -- that is, out of memory

$ ./script.sh 
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
Exit code: 137

I'm going to try to build a container from scratch to see if I can get it to reproduce.

ebezzi commented 10 months ago

Hey @rcannood , thanks for your interest in the Census!

I was able to run your original snippet with the get_anndata on my machine (an M2 laptop with 32G of memory) without issues. In total, the Python process ended up taking about 17G. I was wondering if you could post more information about your system configuration, the output of free, if you ran this exclusively within Docker, etc. If you prefer, you can also mail soma@chanzuckerberg.com and we can move the conversation there.

To clarify the issue with the matrix size, when you call query.X("raw"), what you get is an "indexed" matrix that has the dimension of the whole Census experiment, and you can access rows/columns by their corresponding soma_joinid. The returned matrix is sparse and is only defined (non-zero) where it matches the query result, so the effective allocated memory is going to be smaller.

pablo-gar commented 9 months ago

@rcannood please let us know if you keep observing this issue consistently. As Emanuele mentioned it would be helpful to get as much info as possible from your system.

imtiendat0311 commented 7 months ago

I'm currently facing similar issue but i want to downloading all data. Is it a way to download all data using get_anndata without running out of memory ?

I using Macbook Pro M1 Max , 64 GB Ram, 2 TB SSD

Here is the code i used to download and save it locally

import cellxgene_census
import anndata

with cellxgene_census.open_soma(census_version="2023-12-15") as census:
    adata = cellxgene_census.get_anndata(
        census,
        organism = "Homo sapiens",
        obs_value_filter = "is_primary_data == True"
    )
    # filter adata for only obs assay column mentioning "10x" or including "microwell-seq"
    adata = adata[adata.obs['assay'].isin(['10x', 'microwell-seq']), :]
    # save adata
    adata.write('human.h5ad')

rcannood commented 7 months ago

@rcannood please let us know if you keep observing this issue consistently. As Emanuele mentioned it would be helpful to get as much info as possible from your system.

Hey Pablo! After having rebuilt my Docker container using the latest packages, I could no longer reproduce the issue. Some combination of packages must have triggered this behaviour.

If ever the issue arises again, I can track down the Docker container in question. For now, I will close this issue.

I'm currently facing similar issue but i want to downloading all data.

Given the amount of RAM you have, I think that this is expected behaviour. If you want to import pretty much all of cellxgene_census, AnnData might not be the right data format for you and it might be better to stick with TileDB-SOMA.

Since the issue you're encountering is technically different from mine, I'll close this issue. If you want to continue the discussion on how to best tackle ingesting all of cellxgene-census, it might be best to start a separate discussion on how to approach this. (Feel free to link to this issue if you do!)

bkmartinjr commented 7 months ago

@imtiendat0311 - your example code is attempting to load almost the entire Census into memory, followed by in-memory slicing on assay metadata. In this version of the Census, the example get_anndata() call will read ~36 million cells, which will need at least 0.5 TiB of memory to store the resulting AnnData object.

If you instead filter assay during the get_anndata() call, using obs_value_filter (i.e. read only the cells you need), it will require far less memory. For example:

    adata = cellxgene_census.get_anndata(
        census,
        organism = "Homo sapiens",
        obs_value_filter = "is_primary_data == True and assay in ['10x', 'microwell-seq']"
    )

To see the number of cells that will be returned by any given filter, you can load just obs dataframe and inspect its size, e.g.,

In [13]: obs_value_filter = """is_primary_data == True and assay in ['10x', 'microwell-seq']"""
    ...: 
    ...: with cellxgene_census.open_soma(census_version="2023-12-15") as census:
    ...:     obs_df = (
    ...:         census["census_data"]["homo_sapiens"]
    ...:         .obs.read(
    ...:             value_filter=obs_value_filter,
    ...:             column_names=["soma_joinid"],
    ...:         )
    ...:         .concat()
    ...:         .to_pandas()
    ...:     )
    ...: 
    ...:     print(repr(obs_df))
    ...: 
        soma_joinid  is_primary_data          assay
0          35847242             True  microwell-seq
1          35847243             True  microwell-seq
2          35847244             True  microwell-seq
3          35847245             True  microwell-seq
4          35847246             True  microwell-seq
...             ...              ...            ...
625170     62923341             True  microwell-seq
625171     62923342             True  microwell-seq
625172     62923343             True  microwell-seq
625173     62923344             True  microwell-seq
625174     62923345             True  microwell-seq

[625175 rows x 3 columns]

The resulting AnnData returned by this request is much smaller (<4GiB)

In [24]: adata = cellxgene_census.get_anndata(
    ...:     census,
    ...:     organism = "Homo sapiens",
    ...:     obs_value_filter = "is_primary_data == True and assay in ['10x', 'microwell-seq']"
    ...: )

In [25]: adata.__sizeof__()/1024**3
Out[25]: 3.2756535205990076

chanzuckerberg / cellxgene-census