chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
626 stars 116 forks source link

[BUG] s3 .h5ad files larger than 1 GB don't launch #2530

Open brookmay opened 2 years ago

brookmay commented 2 years ago

Describe the bug Cellxgene fails to launch s3 datasets (..h5ad files) that are larger than 1 GB

To Reproduce Steps to reproduce the behavior: I'm using cellxgene version 1.0.1. We have .h5ad files ranging from 300 MBs in size to 6-7 GB each. I experience no issues launching the files that are < 1GB in size from s3, but for some reason, for all files > 1GB throw "Error: File not found or is inaccessible. File must be an .h5ad object. Please check your input and try again." I have tried the parameter --backed, but it still fails.

For example, below is the cellxgene command and output I get for a file about 4.5 GB in size -

(cxg) $ cellxgene launch "s3://xxxxxxxxxx/out.h5ad" --config-file config.yml --backed
[cellxgene] Starting the CLI...
[cellxgene] Loading data from out.h5ad.
Error: File not found or is inaccessible. File must be an .h5ad object. Please check your input and try again.

Note: These files launch fine if they're first locally downloaded, but we want to be able to launch cellxgene using s3 urls for our project.

Version (please complete the following information):

bkmartinjr commented 2 years ago

Hi @brookmay - thank you for the bug report.

I have a couple of questions to help diagnose next steps:

  1. Can you provide a test case (ie, a publicly readable S3 path that fails for you, which we can also test)?
  2. cellxgene opens S3 objects by first downloading the entire H5AD into the local machine's /tmp directory, and then opens the local copy. It seems likely that the download is failing for some reason. Can you verify you have sufficient disk space in your local /tmp partition to save a file of that size? If not, are you able to create sufficient space?

Also, as you have already determined, the --backed flag is very unlikely to influence this behavior, as that flag affects the application behavior after the file is available (ie, already downloaded and available for reading).

bkmartinjr commented 2 years ago

I have not been able to locally reproduce (tested this with a 3GB H5AD on S3, running cellxgene on my laptop). Will need a bit more info to help diagnose.


Side note: looking at the code that is likely involved, (DataLocator.local_handle()), we could definitely do a better job of reporting errors if they occur -- that might help diagnose this type of failure.

brookmay commented 2 years ago

Hi @bkmartinjr,

It's weird cause I do have space on my local machine -

~$ df -h /tmp
Filesystem     Size   Used  Avail Capacity  iused     ifree %iused  Mounted on
/dev/disk1s1  466Gi  377Gi   61Gi    87% 11040905 634988320    2%   /System/Volumes/Data
brookmay commented 2 years ago

We're also running cellxgene on docker container via cellxgene-gateway (https://github.com/Novartis/cellxgene-gateway) which is also using cellxgene version 1.0.1.

On the container, I can see some h5ad files in /tmp.

[ec2-user@i-xyzxzyzyz ~]$ docker exec -it et9573bd1 /bin/bash
(base) [docker@et9573bd1 ~]$ ls -l /tmp/
total 12
-rw-------. 1 docker docker   0 Jul  6 21:35 cellxgene__1eukuuu.h5ad
-rw-------. 1 docker docker   0 Jul  6 21:35 cellxgene_kp9_uhqn.h5ad
-rw-------. 1 docker docker   0 Jul  7 12:20 cellxgene_nt405nm_.h5ad
-rw-------. 1 docker docker   0 Jul  6 21:35 cellxgene_rs_9h_jb.h5ad
-rwx------. 1 root   root   701 Sep 15  2021 ks-script-4luisyla
-rwx------. 1 root   root   671 Sep 15  2021 ks-script-o23i7rc2
-rwx------. 1 root   root   291 Sep 15  2021 ks-script-x6ei4wuu

And the /tmp directory has space too -

(base) [docker@et9573bd1 ~]$ df -h /tmp/
Filesystem      Size  Used Avail Use% Mounted on
overlay          42G   13G   27G  33% /

@bkmartinjr Do you have any publicly readable s3 files that I can try out?

ebezzi commented 2 years ago

Hi @brookmay, we're actively investigating this and we have two questions that could help us:

  1. Were you able to reproduce this issue without running cellxgene in Docker?
  2. When running it in Docker, how much allocated RAM does your Docker environment have?
brookmay commented 2 years ago

Hi @ebezzi, to answer your questions -

  1. yes, I get the same error when I try to launch cellxgene using s3 url. You can refer the first comment in this issue
  2. Here is a snap of docker stats -
    
    CONTAINER ID        NAME          CPU %         MEM USAGE / LIMIT     MEM %           NET I/O           BLOCK I/O           
    1abc             container1        0.01%        58.27MiB / 61.98GiB     0.09%          2.49GB / 416MB      0B / 0B             
    2xyz             container2        0.01%         61.74MiB / 10GiB       0.60%          6.82GB / 471MB      0B / 0B             

I tried to launch cellxgene using s3 url on both and both failed
bkmartinjr commented 2 years ago

Do you have any publicly readable s3 files that I can try out?

I have temporarily put a very large (4.8GB, 1M+ cell) H5AD here: s3://czi.bruce-public/tmp/be48f323-749f-4ac4-b95e-51831778eca1.h5ad

Please let me know the results of your test (and so I can delete it when you are finished).

I have confirmed it works fine when launched from my laptop (albeit slowly, as it had to download):

$ python --version
Python 3.9.7
$ cellxgene --version
[cellxgene] Version 1.0.1
$ cellxgene launch --verbose s3://czi.bruce-public/tmp/be48f323-749f-4ac4-b95e-51831778eca1.h5ad 
[cellxgene] Starting the CLI...
[cellxgene] Loading data from be48f323-749f-4ac4-b95e-51831778eca1.h5ad.
[cellxgene] Warning: Anndata data matrix is sparse, but not a CSC (columnar) matrix.  Performance may be improved by using CSC. 
[cellxgene] Warning: Obs annotation 'sample' has 1001288 categories, this may be cumbersome or slow to display. We recommend setting the --max-category-items option to 500, this will hide categorical annotations with more than 500 categories in the UI 
[cellxgene] Warning: Var annotation 'feature_name' has 46483 categories, this may be cumbersome or slow to display. We recommend setting the --max-category-items option to 500, this will hide categorical annotations with more than 500 categories in the UI 
WARNING:root:Type float64 will be converted to 32 bit float and may lose precision.
WARNING:root:Type float64 will be converted to 32 bit float and may lose precision.
WARNING:root:Type float64 will be converted to 32 bit float and may lose precision.
[cellxgene] CAUTION: due to the size of your dataset, running differential expression may take longer or fail.
[cellxgene] Launching! Please go to http://localhost:5005 in your browser.
[cellxgene] Type CTRL-C at any time to exit.

If this doesn't work, I suspect @ebezzi will need to provide you with instrumented version to test. Based on the info above, it appears to be failing during the S3 download (before it tries to open the file, it copies to the tmp directory).

Could you also provide us with the output of pip list and pip --version so that we can see the package versions you use to run cellxgene?

ebezzi commented 2 years ago

@brookmay I have prepared a version of cellxgene with additional logging that will hopefully help debug your issue. If you can reach out to me at ebezzi@chanzuckerberg.com, I will send you the package.