SCReadCounts only using one core

Ilarius commented 1 year ago

Hello, I have 2.6 millions of cosmic mutations that I want to check with SCReadCounts, I set the number of cores to 20 (-t 20), however if I check the statistics (with the slurm command sstat) I can see that only one task is associated to the job (even if there are 20 cpus requested). In the .out file I can see that 20 threads have been used but when I look at the % of cpu I see it is very small.

The job has been running for more than 2 days and it is not done yet. Is it normal?

Advanced: Min. Reads (-m) 5 (applied only to VAF, FVAF, RVAF) Max. Reads (-M): None Directional Counts (-D): False Valid Cell Barcode (-b):
Threads (-t): 20 Force (-F): False Quiet (-q): False

edwardsnj commented 1 year ago

Hmmm, that's a very large dataset to analyze. SCReadCounts (actually the readCounts tool that does the counting) does not have native support for slurm. It does have support for multiple threads or multiple communicating processes on multiprocessor machines. However, I/O (for the BAM file) tends to be the rate limiting factor and accessing the BAM file via NFS from a globally available partition will also slow things down (not uncommon for slurm clusters).

You best bet for time time being, I think, would be a wrapper script that a) pulled the BAM file to local disk (/scratch or /tmp), and b) executed on a subset of the SNPs at a time (ideally near one another on the genome, so sort by locus and chunk the file into pieces). This is an old-school manual partitioning approach, but this would do the trick with slurm. Medium-term, I will think about how to make this more seamless.

Another option, if you only care to count loci that have some variants on them, is to use the varLoci tool to find loci that might have variants. This is expected to be a much smaller set, and could be merged with cosmic later. Of course, this will not identify loci that have only reference alleles.

Lastly, finding a way to speed up readCounts is definitely something we are interested in. Single-cell data is big, so there is no free lunch, but we could like to make this scale of computation more feasible.

Hope this helps...

Ilarius commented 1 year ago

I also tried with just 1000 mutations (on the BRCA1) genes and it also was not completed after 5 days. Are you saying that it is not taking the cores because of slurm?

edwardsnj commented 1 year ago

I’m saying that it doesn’t know anything about slurm. Slurm may be manipulating the CPU’s it sees. It may be scheduling the job on a machine with only one available CPU. I/O may be limiting the speed of execution to where less than one CPU’s worth of CPU time is used, even though it has access to more. Given you used only a few SNPs from a limited region I’m suspicious that I/O is the issue - slurm clusters often access data files via NFS, which can be slow for this type of file access.

edwardsnj commented 1 year ago

Hi @Ilarius, I'm back from vacation, was wondering whether I could try to figure out your performance issues a bit. Is your scRNA-Seq FASTAQ/BAM file public? If so, can you point me at it? Thanks...

edwardsnj commented 11 months ago

I did some testing and verified that at least on my system, the threads option works as intended.

Closing due to lack of information...

HorvathLab / NGS

SCReadCounts only using one core #15