About ld refinement on putative somatic SNVs, spend a lot of time

monoplasty commented 10 months ago

Hello, I use single cell sequencing data to run somatic SNV calling from scRNA-seq. It takes a lot of time when I run the second step (cellScan), about 30+ hours. Is there any way to improve the running speed?

The bam file used is about 8G, and the cpu has 32 cores.

Could you please provide some guidance on how to resolve this issue? Thank you in advance for your assistance.

jinzhuangdou commented 10 months ago

Yes, the cellScan step usually take long time since we need to extract cell-level read information. Could you let me know how many cells you have included? You can select cells using the option --keep 0.8 (select cells with most variable reads) to reduce the computational burden.

monoplasty commented 10 months ago

@jinzhuangdou thank you for your reply! The sample I ran had 8609 cells. --keep uses the default value of 0.8 without modification.

slinnarsson commented 10 months ago

The cellScan step is written in such a way that execution time will be quadratic in the number of cells. It takes the list of cell barcodes, and for each barcode, scans the entire BAM file to find the reads from that cell.

for cell in cell_lst:
    para = "merge" + ":" + cell + ":" + args.out + ":" + args.app_path
    joblst.append(para)
with Pool(processes=args.nthreads) as pool:
    result = pool.map(bamSplit, joblst)  # <--- bamSplit scans the whole BAM file for each cell

This means that if 8609 cells takes 30 hours, 2x8609 cells would take five days and 10x8609 cells would take four months.

It could be rewritten to scan the BAM file just once, writing all the cell-specific BAM files in parallel and on the fly. That would likely reduce execution time from 30+ hours to a few minutes. It would make it possible to run Monopogen on much larger samples.

KChen-lab / Monopogen

About ld refinement on putative somatic SNVs, spend a lot of time #18