ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

Monitoring progress/expected run time? #1124

Closed evo-eco-gen closed 1 year ago

evo-eco-gen commented 1 year ago

Hi, I am using progressiveCACTUS to align 15 rodent genomes (2-3gbp in length), almost all quite contiguous (N50 in millions). Is there a way to know the runtime or see how much remains to be done? I am using 40 AVX-enabled CPUs, the job has been running for >6 days (>5,800 CPU/hours). Based on figures from the 2020 paper I expected it to be done by now. Is there something in the log to estimate the % of the job done?

glennhickey commented 1 year ago

At the beginning the the log, it'll print out your tree, ex

Tree: ((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0
.16303)Anc2:0.032898)Anc0;

The ancestor nodes will be labelled Anc0, Anc1, etc. It's going to progressively align up the tree, so Anc0 will be the last job. You can see what node(s) it's working on / completed in the log by looking at the cactus_consolidated messages. For example, this bit tells me it just finished Anc2

...
cactus_consolidated(Anc2): Dumped reference sequences, 13 seconds have elapsed
cactus_consolidated(Anc2): Cactus consolidated is done!, 13 seconds have elapsed

Off the top of my head, I'd estimate about a month running time for your 15 rodents on 40 cores. Very curious which figure in the 2020 paper would lead you to believe it could be done in under 6 days...

evo-eco-gen commented 1 year ago

Thanks, this has been helpful. As for the inaccurate prediction of runtime: I can now see that the genomes you simulated in that study were 30mb in length, not 300mb, which explains why I was off by a factor of ten in my estimate.