Closed rcannood closed 7 months ago
It appears that the Xcoo
object I'm extracting from the query is the whole cellxgene census, since:
>>> exp.obs.read().concat().to_pandas().shape
(5255245, 21)
I can therefore use the obs["soma_joinid"]
to retrieve the expected X:
Xcoos = Xcoo.to_scipy().tocsr()
Xcoos_subset = Xcoos[obs["soma_joinid"]]
print(f"Xcoos_subset.shape: {Xcoos_subset.shape}") # (301796, 52392)
A valid workaround for my issue can therefore be:
def query_to_anndata(query):
obs = query.obs().concat().to_pandas()
var = query.var().concat().to_pandas()
X = query.X("raw").coos().concat().tocsr()
X_subset = X[obs["soma_joinid"]]
return ad.AnnData(
X=X_subset,
obs=obs,
var=var
)
I can reproduce the issue by running:
docker run --rm -i --memory=20g --entrypoint python ghcr.io/openproblems-bio/datasets/loaders/query_cellxgene_census@sha256:62a0b3ea4beb4432231c28168326cc5eb74b7ee263d09e1eec18c18f56a960ba - << EOF
import cellxgene_census
# this runs out of memory and crashes
with cellxgene_census.open_soma() as census:
ad = cellxgene_census.get_anndata(
census=census,
organism="mus_musculus",
obs_value_filter=f"dataset_id == '49e4ffcc-5444-406d-bdee-577127404ba8'"
)
print(f"AnnData: {ad}", flush=True)
EOF
echo "Exit code: $?"
Which results in an exit code 137 -- that is, out of memory
$ ./script.sh
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
Exit code: 137
I'm going to try to build a container from scratch to see if I can get it to reproduce.
Hey @rcannood , thanks for your interest in the Census!
I was able to run your original snippet with the get_anndata
on my machine (an M2 laptop with 32G of memory) without issues. In total, the Python process ended up taking about 17G. I was wondering if you could post more information about your system configuration, the output of free
, if you ran this exclusively within Docker, etc. If you prefer, you can also mail soma@chanzuckerberg.com and we can move the conversation there.
To clarify the issue with the matrix size, when you call query.X("raw")
, what you get is an "indexed" matrix that has the dimension of the whole Census experiment, and you can access rows/columns by their corresponding soma_joinid
. The returned matrix is sparse and is only defined (non-zero) where it matches the query result, so the effective allocated memory is going to be smaller.
@rcannood please let us know if you keep observing this issue consistently. As Emanuele mentioned it would be helpful to get as much info as possible from your system.
I'm currently facing similar issue but i want to downloading all data. Is it a way to download all data using get_anndata without running out of memory ?
I using Macbook Pro M1 Max , 64 GB Ram, 2 TB SSD
Here is the code i used to download and save it locally
import cellxgene_census
import anndata
with cellxgene_census.open_soma(census_version="2023-12-15") as census:
adata = cellxgene_census.get_anndata(
census,
organism = "Homo sapiens",
obs_value_filter = "is_primary_data == True"
)
# filter adata for only obs assay column mentioning "10x" or including "microwell-seq"
adata = adata[adata.obs['assay'].isin(['10x', 'microwell-seq']), :]
# save adata
adata.write('human.h5ad')
@rcannood please let us know if you keep observing this issue consistently. As Emanuele mentioned it would be helpful to get as much info as possible from your system.
Hey Pablo! After having rebuilt my Docker container using the latest packages, I could no longer reproduce the issue. Some combination of packages must have triggered this behaviour.
If ever the issue arises again, I can track down the Docker container in question. For now, I will close this issue.
I'm currently facing similar issue but i want to downloading all data.
Given the amount of RAM you have, I think that this is expected behaviour. If you want to import pretty much all of cellxgene_census, AnnData might not be the right data format for you and it might be better to stick with TileDB-SOMA.
Since the issue you're encountering is technically different from mine, I'll close this issue. If you want to continue the discussion on how to best tackle ingesting all of cellxgene-census, it might be best to start a separate discussion on how to approach this. (Feel free to link to this issue if you do!)
@imtiendat0311 - your example code is attempting to load almost the entire Census into memory, followed by in-memory slicing on assay
metadata. In this version of the Census, the example get_anndata()
call will read ~36 million cells, which will need at least 0.5 TiB of memory to store the resulting AnnData object.
If you instead filter assay
during the get_anndata()
call, using obs_value_filter
(i.e. read only the cells you need), it will require far less memory. For example:
adata = cellxgene_census.get_anndata(
census,
organism = "Homo sapiens",
obs_value_filter = "is_primary_data == True and assay in ['10x', 'microwell-seq']"
)
To see the number of cells that will be returned by any given filter, you can load just obs
dataframe and inspect its size, e.g.,
In [13]: obs_value_filter = """is_primary_data == True and assay in ['10x', 'microwell-seq']"""
...:
...: with cellxgene_census.open_soma(census_version="2023-12-15") as census:
...: obs_df = (
...: census["census_data"]["homo_sapiens"]
...: .obs.read(
...: value_filter=obs_value_filter,
...: column_names=["soma_joinid"],
...: )
...: .concat()
...: .to_pandas()
...: )
...:
...: print(repr(obs_df))
...:
soma_joinid is_primary_data assay
0 35847242 True microwell-seq
1 35847243 True microwell-seq
2 35847244 True microwell-seq
3 35847245 True microwell-seq
4 35847246 True microwell-seq
... ... ... ...
625170 62923341 True microwell-seq
625171 62923342 True microwell-seq
625172 62923343 True microwell-seq
625173 62923344 True microwell-seq
625174 62923345 True microwell-seq
[625175 rows x 3 columns]
The resulting AnnData returned by this request is much smaller (<4GiB)
In [24]: adata = cellxgene_census.get_anndata(
...: census,
...: organism = "Homo sapiens",
...: obs_value_filter = "is_primary_data == True and assay in ['10x', 'microwell-seq']"
...: )
In [25]: adata.__sizeof__()/1024**3
Out[25]: 3.2756535205990076
Describe the bug
When I try to extract an AnnData from CELLxGENE census using the Python interface, the process runs out of memory (>200GB) and crashes.
To Reproduce
The following script runs out of memory and crashes:
However, simply downloading the source h5ad to disk and reading this works quite well.
By digging a little into the implementation of
cellxgene_census
,tiledbsoma
andsomacore
, I get a general idea of what could be going wrong. When running the query manually, I get the same number of cells as in the source h5ad:When I fetch the obs and the var, I also get reasonable dimensions:
However, when I take a look at the X, the number of cells suddenly jumps from 301'796 to 5'255'245.
Fetching all of the data as a sparse matrix takes a bit of time to fetch all of the data but it works quite well.
This leaves me wondering why the number of cells differs depending on whether I'm looking at the
.obs
or the.X
, and whether the.X
matrix is being fetched as a dense matrix when querying cellxgene_census.Am I overlooking something here?
Expected behavior
I'd expect
cellxgene_census.get_anndata(...)
to return a 301'796 × 52'392 AnnData object with X being a sparse matrix.Environment
Machine: x86-64 system with 32 threads and 128GB ram OS: Fedora 39 Python: 3.11
Output of
``` $ pip list Package Version Editable project location ------------------------- ------------------- --------------------------------------------------------------------------- aiobotocore 2.7.0 aiohttp 3.8.6 aioitertools 0.11.0 aiosignal 1.3.1 anndata 0.8.0 anyio 4.0.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 array-api-compat 1.4 arrow 1.3.0 asttokens 2.4.0 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.1.0 Babel 2.13.0 backcall 0.2.0 bleach 6.1.0 botocore 1.31.64 cachetools 5.3.1 cellxgene-census 1.9.1.dev1+g3892ef4 /home/rcannood/workspace/cellxgene-census/api/python/cellxgene_census cellxgene-schema 3.1.3 certifi 2023.7.22 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.3 comm 0.1.4 contourpy 1.2.0 coverage 7.2.3 cssutils 2.9.0 cycler 0.12.1 Cython 0.29.34 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 executing 2.0.0 fastjsonschema 2.18.1 fastobo 0.12.2 fonttools 4.44.3 fqdn 1.5.1 frozenlist 1.4.0 fsspec 2023.10.0 h11 0.14.0 h5py 3.10.0 httpcore 0.18.0 httpx 0.25.0 idna 3.4 iniconfig 2.0.0 ipykernel 6.25.2 ipython 8.16.1 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.19.1 jsonschema-specifications 2023.7.1 jupyter 1.0.0 jupyter_client 8.4.0 jupyter-console 6.6.3 jupyter_core 5.4.0 jupyter-events 0.8.0 jupyter-lsp 2.2.0 jupyter_server 2.8.0 jupyter_server_terminals 0.4.4 jupyterlab 4.0.7 jupyterlab-pygments 0.2.2 jupyterlab_server 2.25.0 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 llvmlite 0.40.1 matplotlib 3.8.1 matplotlib-inline 0.1.6 mistune 3.0.2 multidict 6.0.4 natsort 8.4.0 nbclient 0.8.0 nbconvert 7.9.2 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 notebook 7.0.6 notebook_shim 0.2.3 numba 0.57.0 numpy 1.23.2 overrides 7.4.0 Owlready2 0.38 packaging 23.2 pandas 1.4.4 pandocfilters 1.5.0 parso 0.8.3 patsy 0.5.3 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3 platformdirs 3.11.0 pluggy 1.3.0 premailer 3.10.0 prometheus-client 0.17.1 pronto 2.5.5 pure-eval 0.2.2 pyarrow 14.0.1 pyarrow-hotfix 0.6 pynndescent 0.5.10 pyparsing 3.1.1 pytest 7.2.2 python-dateutil 2.8.2 python-json-logger 2.0.7 python-telegram-bot 20.6 pytz 2023.3.post1 PyYAML 6.0 pyzmq 25.1.1 qtconsole 5.4.4 QtPy 2.4.0 referencing 0.30.2 requests 2.31.0 requests-mock 1.11.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rpds-py 0.10.6 ruamel.yaml 0.18.0 s3fs 2023.10.0 scanpy 1.9.6 scikit-learn 1.3.2 scipy 1.11.3 seaborn 0.12.2 semver 3.0.0 Send2Trash 1.8.2 session-info 1.0.0 setuptools 68.0.0 six 1.16.0 sniffio 1.3.0 somacore 1.0.6 stack-data 0.6.3 statsmodels 0.14.0 stdlib-list 0.9.0 tbb 2021.11.0 terminado 0.17.1 threadpoolctl 3.2.0 tiledb 0.24.0 tiledbsoma 1.6.0 tinycss2 1.2.1 tornado 6.3.3 tqdm 4.66.1 traitlets 5.11.2 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 tzdata 2023.3 umap-learn 0.5.4 uri-template 1.3.0 urllib3 2.0.7 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 wheel 0.40.0 widgetsnbextension 4.0.9 wrapt 1.16.0 yagmail 0.15.293 yarl 1.9.2 ```pip list