Problems running the `compute_qc_stats()` [PERFORMANCE]

maxim-h commented 2 years ago

Describe the bug

When following the Single_sample_workflow.ipynb I get an error while executing compute_qc_stats(). I'm using one of the samples from NeurIPS2021 BMMC data.

To Reproduce

from pycisTopic.qc import *
path_to_regions = {'s1d1':outDir + 'consensus_peak_calling/consensus_regions.bed'}
metadata_bc, profile_data_dict = compute_qc_stats(fragments_dict = fragments_dict,
                tss_annotation = annot,
                stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
                label_list = None,
                path_to_regions = path_to_regions,
                n_cpu = 12,
                valid_bc = None,
                n_frag = 100,
                n_bc = None,
                tss_flank_window = 1000,
                tss_window = 50,
                tss_minimum_signal_window = 100,
                tss_rolling_window = 10,
                remove_duplicates = True,
                _temp_dir =  "/scratch/kholmato/" + 'ray_spill')

Error output

Command output

``` 2022-05-10 09:28:21,105 cisTopic INFO n_cpu is larger than the number of samples. Setting n_cpu to the number of samples 2022-05-10 09:28:27,947 INFO services.py:1412 -- View the Ray dashboard at [http://127.0.0.1:8265](http://127.0.0.1:8265 (raylet) loop.run_until_complete(agent.run()) (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete (raylet) return future.result() (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/dashboard/agent.py", line 178, in run (raylet) modules = self._load_modules() (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/dashboard/agent.py", line 120, in _load_modules (raylet) c = cls(self) (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__ (raylet) self._metrics_agent = MetricsAgent( (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/_private/metrics_agent.py", line 75, in __init__ (raylet) prometheus_exporter.new_stats_exporter( (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter (raylet) exporter = PrometheusStatsExporter( (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__ (raylet) self.serve_http() (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http (raylet) start_http_server( (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server (raylet) TmpServer.address_family, addr = _get_best_family(addr, port) (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family (raylet) infos = socket.getaddrinfo(address, port) (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/socket.py", line 918, in getaddrinfo (raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags): (raylet) socket.gaierror: [Errno -2] Name or service not known (raylet) (raylet) During handling of the above exception, another exception occurred: (raylet) (raylet) Traceback (most recent call last): (raylet) File "/home/kholmato/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/dashboard/agent.py", line 407, in (raylet) gcs_publisher = GcsPublisher(args.gcs_address) (raylet) TypeError: __init__() takes 1 positional argument but 2 were given 2022-05-10 09:31:45,265 WARNING worker.py:1326 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: c844f38e73700f0a4092937dde2f6c2a1e9317c301000000 Worker ID: 90e5694e1b68f0f6d4866e912f15492dc66870249d2681b505f79e20 Node ID: d0e2524ba25878415b949b86b56f6569d19338e5c52ec713e016e65b Worker IP address: 10.133.125.109 Worker port: 37471 Worker PID: 365 (compute_qc_stats_ray pid=1047) 2022-05-10 09:32:02,008 cisTopic INFO Reading s1d1 2022-05-10 09:35:07,922 WARNING worker.py:1326 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 5d185699b76f18f8685f983012c6b70ff6ad8df401000000 Worker ID: e350ccece30580bf12005d7cd2ef2c161257d8b3d0a2c7e6a2197a51 Node ID: d0e2524ba25878415b949b86b56f6569d19338e5c52ec713e016e65b Worker IP address: 10.133.125.109 Worker port: 40743 Worker PID: 1047 (compute_qc_stats_ray pid=1126) 2022-05-10 09:35:13,087 cisTopic INFO Reading s1d1 2022-05-10 09:38:18,093 WARNING worker.py:1326 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: b23d4387bf3f51e56bf41be17b75897f9c1294ad01000000 Worker ID: 8aa66876fe55248e290ed116d9da9a48333811ca55c382d2429cd844 Node ID: d0e2524ba25878415b949b86b56f6569d19338e5c52ec713e016e65b Worker IP address: 10.133.125.109 Worker port: 45711 Worker PID: 1126 2022-05-10 09:41:28,109 WARNING worker.py:1326 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 2744be47b942047896a10447f27a3e3b5525f1f201000000 Worker ID: 667d84edea8e1a0de4ab3869c7354e844f178fa695cb592ee538868e Node ID: d0e2524ba25878415b949b86b56f6569d19338e5c52ec713e016e65b Worker IP address: 10.133.125.109 Worker port: 35070 Worker PID: 1168 --------------------------------------------------------------------------- WorkerCrashedError Traceback (most recent call last) Input In [17], in () 1 from pycisTopic.qc import * 2 path_to_regions = {'s1d1':outDir + 'consensus_peak_calling/consensus_regions.bed'} ----> 3 metadata_bc, profile_data_dict = compute_qc_stats(fragments_dict = fragments_dict, 4 tss_annotation = annot, 5 stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'], 6 label_list = None, 7 path_to_regions = path_to_regions, 8 n_cpu = 12, 9 valid_bc = None, 10 n_frag = 100, 11 n_bc = None, 12 tss_flank_window = 1000, 13 tss_window = 50, 14 tss_minimum_signal_window = 100, 15 tss_rolling_window = 10, 16 remove_duplicates = True, 17 _temp_dir = "/scratch/kholmato/" + 'ray_spill') File ~/.builds/scenicplus/pycisTopic/pycisTopic/qc.py:859, in compute_qc_stats(fragments_dict, tss_annotation, stats, label_list, path_to_regions, n_cpu, partition, valid_bc, n_frag, n_bc, tss_flank_window, tss_window, tss_minimum_signal_window, tss_rolling_window, min_norm, check_for_duplicates, remove_duplicates, **kwargs) 856 n_cpu = len(fragments_list) 858 ray.init(num_cpus=n_cpu, **kwargs) --> 859 qc_stats = ray.get( 860 [ 861 compute_qc_stats_ray.remote( 862 fragments_list[i], 863 tss_annotation=tss_annotation, 864 stats=stats, 865 label=label_list[i], 866 path_to_regions=path_to_regions[i], 867 valid_bc=valid_bc, 868 n_frag=n_frag, 869 n_bc=n_bc, 870 tss_flank_window=tss_flank_window, 871 tss_window=tss_window, 872 tss_minimum_signal_window=tss_minimum_signal_window, 873 tss_rolling_window=tss_rolling_window, 874 min_norm=min_norm, 875 partition=partition, 876 check_for_duplicates=check_for_duplicates, 877 remove_duplicates=remove_duplicates) 878 for i in range(len(fragments_list)) 879 ] 880 ) 881 ray.shutdown() 882 metadata_dict = {key: x[key] for x in list( 883 list(zip(*qc_stats))[0]) for key in x.keys()} File ~/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook..wrapper(*args, **kwargs) 103 if func.__name__ != "init" or is_client_mode_enabled_by_default: 104 return getattr(ray, func.__name__)(*args, **kwargs) --> 105 return func(*args, **kwargs) File ~/.micromamba/envs/scenicplus/lib/python3.8/site-packages/ray/worker.py:1765, in get(object_refs, timeout) 1763 raise value.as_instanceof_cause() 1764 else: -> 1765 raise value 1767 if is_individual_id: 1768 values = values[0] WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information. ```

Expected behavior Not an error)

Version (please complete the following information):

Python: '3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0]'

pip freeze

adjustText==0.7.3 aiohttp==3.8.1 aiohttp-cors==0.7.0 aioredis==1.3.1 aiosignal==1.2.0 anndata==0.7.8 annoy==1.17.0 appdirs==1.4.4 arboreto==0.1.6 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 asttokens==2.0.5 async-timeout==4.0.2 attr==0.3.1 attrs==21.4.0 backcall==0.2.0 bbknn==1.5.1 beautifulsoup4==4.11.1 bioservices==1.8.4 bleach==5.0.0 blessed==1.19.1 bokeh==2.4.2 boltons==21.0.0 cachetools==5.0.0 cattrs==1.10.0 certifi==2021.10.8 cffi==1.15.0 charset-normalizer==2.0.12 click==8.1.2 cloudpickle==2.0.0 colorama==0.4.4 colorful==0.5.4 colorlog==6.6.0 cryptography==36.0.2 ctxcore==0.1.1 cycler==0.11.0 Cython==0.29.28 cytoolz==0.11.2 dask==2022.4.0 dataclasses-json==0.5.7 debugpy==1.6.0 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.13 deprecation==2.1.0 dill==0.3.4 distributed==2022.4.0 docutils==0.18.1 easydev==0.12.0 entrypoints==0.4 executing==0.8.3 fastjsonschema==2.15.3 fbpca==1.0 filelock==3.6.0 fonttools==4.32.0 frozendict==2.3.1 frozenlist==1.3.0 fsspec==2022.3.0 future==0.18.2 gensim==4.1.2 geosketch==1.2 germalemma==0.1.3 gevent==21.12.0 globre==0.1.5 google-api-core==2.7.2 google-auth==2.6.4 googleapis-common-protos==1.56.0 gpustat==1.0.0b1 greenlet==1.1.2 grequests==0.6.0 grpcio==1.43.0 gseapy==0.10.8 h5py==3.6.0 harmonypy==0.0.5 HeapDict==1.0.1 hiredis==2.0.0 idna==3.3 igraph==0.9.10 imageio==2.16.2 importlib-resources==5.7.0 interlap==0.2.7 intervaltree==3.1.0 ipykernel==6.13.0 ipympl==0.9.0 ipython==8.2.0 ipython-genutils==0.2.0 ipywidgets==7.7.0 jedi==0.18.1 Jinja2==3.1.1 joblib==1.1.0 jsonschema==4.4.0 jupyter-client==7.2.2 jupyter-core==4.9.2 jupyterlab-pygments==0.2.2 jupyterlab-widgets==1.1.0 kiwisolver==1.4.2 lda==2.0.0 leidenalg==0.8.9 llvmlite==0.38.0 locket==0.2.1 loompy==3.0.7 loomxpy==0.4.1 lxml==4.8.0 MACS2 @ file:///opt/conda/conda-bld/macs2_1645526782437/work MarkupSafe==2.1.1 marshmallow==3.15.0 marshmallow-enum==1.5.1 matplotlib==3.5.1 matplotlib-inline==0.1.3 mistune==0.8.4 msgpack==1.0.3 mudata==0.1.1 multidict==6.0.2 multiprocessing-on-dill==3.5.0a4 muon==0.1.2 mypy-extensions==0.4.3 natsort==8.1.0 nbclient==0.6.0 nbconvert==6.5.0 nbformat==5.3.0 ncls==0.0.64 nest-asyncio==1.5.5 networkx==2.8 nltk==3.7 notebook==6.4.10 numba==0.55.1 numexpr==2.8.1 numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1649572648093/work numpy-groupies==0.9.14 nvidia-ml-py3==7.352.0 opencensus==0.8.0 opencensus-context==0.1.2 packaging==21.3 pandas==1.4.2 pandocfilters==1.5.0 parso==0.8.3 partd==1.2.0 patsy==0.5.2 PatternLite==3.6 pbr==3.1.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.1.0 prometheus-client==0.14.1 prompt-toolkit==3.0.29 protobuf==3.20.0 psutil==5.9.0 ptyprocess==0.7.0 pure-eval==0.2.2 py-spy==0.3.11 pyarrow==0.16.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybedtools==0.9.0 pyBigWig==0.3.18 pybiomart==0.2.0 -e git+ssh://git@github.com/aertslab/pycistarget.git@d2571bf92579addfc8a21465bfbcd0de86aedc61#egg=pycistarget -e git+ssh://git@github.com/aertslab/pycisTopic.git@ddb3caf583312f1169776a21650bc47ff4daebe6#egg=pycisTopic pycparser==2.21 pyfasta==0.5.2 Pygments==2.11.2 pynndescent==0.5.6 pyOpenSSL==22.0.0 pyparsing==3.0.8 pyphen==0.12.0 pyranges==0.0.115 pyrle==0.0.34 pyrsistent==0.18.1 pysam==0.19.0 pyscenic==0.11.2 python-dateutil==2.8.2 python-igraph==0.9.10 python-Levenshtein==0.12.2 pytz==2022.1 PyWavelets==1.3.0 PyYAML==6.0 pyzmq==22.3.0 ray==1.11.0 redis==4.2.2 regex==2022.3.15 requests==2.27.1 requests-cache==0.9.3 rsa==4.8 scanorama==1.7.2 scanpy==1.9.1 -e git+ssh://git@github.com/aertslab/scenicplus.git@6faa2d62416e3c78aeb45eead5c70cdafecb762f#egg=scenicplus scikit-image==0.19.2 scikit-learn==0.24.2 scipy==1.8.0 scrublet==0.2.3 seaborn==0.11.2 Send2Trash==1.8.0 session-info==1.0.0 six==1.16.0 sklearn==0.0 smart-open==5.2.1 sorted-nearest==0.0.33 sortedcontainers==2.4.0 soupsieve==2.3.2.post1 stack-data==0.2.0 statistics==1.0.3.5 statsmodels==0.13.2 stdlib-list==0.8.0 suds-community==1.1.0 tables==3.7.0 tabulate==0.8.9 tblib==1.7.0 terminado==0.13.3 texttable==1.6.4 threadpoolctl==3.1.0 tifffile==2022.4.8 tinycss2==1.1.1 tmtoolkit==0.9.0 toolz==0.11.2 tornado==6.1 tqdm==4.64.0 traitlets==5.1.1 typing==3.7.4.3 typing-inspect==0.7.1 typing_extensions==4.1.1 umap-learn==0.5.3 url-normalize==1.4.3 urllib3==1.26.9 wcwidth==0.2.5 webencodings==0.5.1 widgetsnbextension==3.6.0 wrapt==1.14.0 xlrd==1.2.0 xmltodict==0.12.0 yarl==1.7.2 zict==2.1.0 zipp==3.8.0 zope.event==4.5.0 zope.interface==5.4.0

Additional context I'm running in a JupyterHub instance hosted at our institute. The whole working directory is on an NFS.

cbravo93 commented 2 years ago

Hi!

You are running out of memory, can you try lowering the number of cores?

Cheers,

C

PD: Also can you show your fragments dict? Do all samples have the same name?

maxim-h commented 2 years ago

Thanks for a quick response)

I've tried less cores. But even with n_cpu=1 the same is happening. Is there any rule of thumb currently for how much RAM is required? I'm on a 64G system with no swap. This is what I can typically get that can run Jupyter. Otherwise I'm having to wrap each function call in a script and submit it to the HPC cluster, which is also what I had to do with export_pseudobulk.

The fragments dict only has one sample:

fragments_dict = {'s1d1':'/scratch/kholmato/PROJECTS/NeurIPS2021/fragment_files/s1d1/atac_fragments.tsv.gz'}

I've decided to start with just one sample since there's no multisample tutorial yet.

P.S.: I've noticed that Jupyter kinda loses track of functions that use ray. Meaning it tells me that the kernel is idle even though the function is still running. Or maybe the function returns before the end of the child precess' execution. Is there a way to force jupyter (or the function itself) to wait until the end of execution? P.P.S.: It looks like ray is retrying the failed "jobs" several times judging by the line :

(compute_qc_stats_ray pid=3268) 2022-05-10 13:08:22,702 cisTopic     INFO     Reading s1d1

that shows up multiple times in the log. If that's the case, is there a way to ask it not to retry any failed steps?

cbravo93 commented 2 years ago

Yes, I havent seen this before, weird... Also not sure why ray is showing multiple times the same log line in your case (which version do you have? We use 1.12).

Since it seems it just gets stuck in reading the file already, can you try if you can read the fragments file independent of ray (and check how much memory it takes? how big is it?). You can use the function read_fragments_from_file. If you manage to read it, you may need to adjust the object memory of ray to that range of values when using ray.

Cheers,

Carmen

maxim-h commented 2 years ago

Ah, yes, it is just that the fragment file is too big. Just doing the read_fragments_from_file crashes the session. It's around 10G uncompressed in my case, but probably converting to pyranges brings it to 20G in total, and maybe there's some other junk floating in memory. Btw, use_polars is really nice. I still run out of memory, but I don't have to wait minutes for it to happen).

I'll try to find a bigger system if I can. Or just skip the QCs for now and initialize the cisTopic object from a precomputed count Matrix. Is that option already available?

Regarding ray repeating the same thing, I used 1.11 first. Updating to 1.12 didn't change anything.

cbravo93 commented 2 years ago

Hi Max!

Indeed, in the close future our intention is to move to polars/pytables, this should also make these steps more memory efficient hopefully. At the moment we have a dependency on pyarrow 0.17 (from pyscenic/ctxcore) that we are solving to upgrade to pyarrow 7, and then we will start the updates here as well :).

It is possible to start from a precomputed matrix (you can also plug-in qc stats if you have them), you can use:

# Create cisTopic object
from pycisTopic.cistopic_class import *
matrix_path='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/toy_data/toy_data/cistopic_countmatrix.tsv'
path_to_blacklist='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/pycisTopic/blacklist/hg19-blacklist.v2.bed'
cisTopic_obj = create_cistopic_object_from_matrix_file(matrix_path, path_to_blacklist)
# Adding cell information
cell_data =  pd.read_csv('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/toy_data/toy_data/celldata_mel.tsv', sep='\t')
cisTopic_obj.add_cell_data(cell_data)

Preferably you would use the pseudobulk peaks to create this matrix. Normally we use the fragments file and the consensus peaks to create the count matrix, but I am afraid you will have the same issue (since it requires reading the fragments file).

maxim-h commented 2 years ago

Ok, thanks! The count matrix based on pseudobulk peaks I can create in R. Signac::FeatureMatrix can compute it without loading the fragments file into memory. Maybe there's even a python alternative, but I'm not aware. I'll let you know how it goes.

Cheers, Max

cbravo93 commented 1 year ago

Check the polars brach :)

ghuls commented 6 months ago

Speed improvements in QC calculation are in the polars branch: https://pycistopic.readthedocs.io/en/polars/notebooks/human_cerebellum.html#QC

aertslab / pycisTopic

Problems running the `compute_qc_stats()` [PERFORMANCE] #31