ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

cactus-hal2maf takes a long time to run #1366

Open Sanat-Mishra opened 2 months ago

Sanat-Mishra commented 2 months ago

Hi, here is my command for running cactus-hal2maf:

cactus-hal2maf ./jobstore /ocean/projects/bio200049p/smishra1/Files/241-mammalian-2020v2.hal /ocean/projects/bio200049p/smishra1/Tools_Installed/cactus-bin-v2.8.1/ENST00000293981.10.maf.gz --refGenome Homo_sapiens --bedRanges /ocean/projects/bio200049p/zjiang2/Files/cactus_test/ENST00000293981.10.bed --noAncestors --chunkSize 10000 --workDir /ocean/projects/bio200049p/smishra1/Cactus/ --batchCores 1 --batchSystem slurm --maxMemory 240G

This has been running for more than an hour. Is this something we expect or can I speed it up somehow?

Thanks!

glennhickey commented 2 months ago

If your cluster has more than one node, you can try using --batchCount to increase the number of jobs.

If your nodes have multiple cores, you can also consider --batchCores to allow each job to use more than one thread.

Sanat-Mishra commented 2 months ago

Thanks! I'm also concerned that the code is taking too long because I have a huge hal file (~805 GB). Does cactus-hal2maf create copies of the hal file to distribute among worker nodes?

glennhickey commented 2 months ago

Yes, on slurm cactus will create a local copy for each batch (job). So you typically want to use --batchCores to make sure that at most one job runs at once on a given node...

Sanat-Mishra commented 2 months ago

Got it.

We're trying to pull out alignments corresponding to different transcript BED files and then concatenating the blocks in the MAF file to do downstream analysis. Do you have any suggestions on how we can use --maxrefgaps or about concatenation in general?

glennhickey commented 2 months ago

This probably doesn't answer your question, but the --bedRanges option here will produce a MAF of your regions concatenated together. If you want the actual MAF blocks merged together of consecutive regions, it's probably simplest to merge them in the BED first.

In terms of reference gaps, you control that with --maximumGapLength, but I'm not sure how high you can practically scale that to (I think the default is around 50).