aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
167 stars 27 forks source link

`calculate_TFs_to_genes_relationships` fails during initialization on large datasets #109

Open dburkhardt opened 1 year ago

dburkhardt commented 1 year ago

My total dataset size is 220000 peaks, 14000 genes, 160000 cells. I cannot get the scenicplus calculate_TFs_to_genes_relationships to run, even with 5TB of disk and 2TB of RAM.

As mentioned in previous issues, the function seems to stall out during initialization when spilling objects to disk, and the error isn't clear. However, I think that this isn't being configured correctly because with an instance of this size, I shouldn't be hitting RAM errors.

I'm opening a new issue, because previous one was closed https://github.com/aertslab/scenicplus/issues/52.

Here's the last log output I see before the ray instance crashes.

initializing:   8%|▊         | 1162/14106 [17:15:08<189:36:29, 52.73s/it](raylet) [2023-02-12 04:35:09,760 E 45605 45629] (raylet) file_system_monitor.cc:105: /home/jovyan/ray_spill/session_2023-02-11_11-18-22_070880_42848 is over 95% full, available space: 264205611008; capacity: 5284267606016. Object creation will fail if spilling is required.

initializing:   8%|▊         | 1162/14106 [17:16:08<192:22:05, 53.50s/it]
Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.
2023-02-12 04:38:00,570 TF2G         INFO     Took 62333.003101587296 seconds
2023-02-12 04:38:00,571 TF2G         INFO     Adding correlation coefficients to adjacencies.
SeppeDeWinter commented 1 year ago

Hi @dburkhardt

I've been working a bit on this code. A week a go I pushed some changes to the developmental branch that might improve the performance of this step. I was not able to test on a large dataset, but at least on a small one it is more efficient.

https://github.com/aertslab/scenicplus/commit/ec1d8d0a6fc398f88bf64e94af4ae4b4280fe6c9

I hope this help.

Best,

Seppe

rsavur commented 1 year ago

Hi @SeppeDeWinter

I was interested in running this substitute code you mentioned to try it on the pbmc tutorial dataset. How do I go about updating the code? Do I just copy your new TFtoGene and util py files and rewrite the old ones?

Thank you,

Savur

SeppeDeWinter commented 1 year ago

Hi @rsavur

Sorry for the late reply.

Indeed, you can pull the code from the development branch and use that one instead.

Best,

Seppe