ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
518 stars 111 forks source link

cactus-graphmap-split taking a long time - is this expected? #1464

Open bhdobrin opened 2 months ago

bhdobrin commented 2 months ago

Hi, I want to produce a vcf file (for phylogenetic analysis) for 15 whole genome, same-species assemblies. I am following the step-by-step pipeline documented under the heading "Yeast Genome" in Minigraph-Cactus help:

cactus-minigraph cactus-graphmap cactus-graphmap-split cactus-align --batch cactus-graphmap-join

Having searched through the issue threads a bit, I see that you recommend using the single-command cactus-pangenome and skipping the multiple steps. However, I am wondering about the expected time to completion and about threaded execution:

1) cactus-graphmap-split has been running for 3 and a half days. Is this normal? At the moment, it is executing the samtools faidx calls at cactus-graphmap-split.py line ~500, ~ 3/4 of the way through the script.

2) my execution log entries (delivered through the slurm error log) all originate from [Main Thread]. Could this mean I failed to invoke multithreading? My installation is an apptainer installed on a cluster by my organization, so the cactus-specific threading commands do not work. I requested cores (64) using the usual SBATCH commands. My scontrol output shows I requested 64 cores, but I do not know how to find out if the program is using those threads.

I am using cactus 2.7.2.

Thanks.

glennhickey commented 2 months ago

Yeah, that seems way too long, even for one thread. It run faidx once per reference contig per input genome. If there are tons of reference contigs, this may be a factor -- though I think there are some heuristics to choose only the biggest contigs and lump the other ones in "chrOther". You should be able to get a sense of this from the logs, and coutning how many faidx invocations there are. Using the cactus-pangenome interface may help in this regard, or trying the --refContigs option. The only other thing I can think of is if your disk is extremely slow (as this code is very i/o bound).

bhdobrin commented 2 months ago

Thank you, based on this answer I think I know what went wrong with this run :) .