ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Only trim outgroups if there are any #1399

Closed glennhickey closed 3 weeks ago

glennhickey commented 4 weeks ago

The job trim_unaligned_sequences currently estimates its memory as a function of its input size.

memory=cactus_clamp_memory(8*sum([seq.size for seq in sequences]) + 32*alignments.size))

This seems to have been working fairly well but @ph09 ran into an issue where it would estimate the minimum value of 2G while needing roughly 11G. It turns out this happens at the root node of 3-human alignment, where the input PAF is 40Mb but there are no input sequences. This explains the minimal estimate. But I guess paffy to_bed uses memory proportional to the ingroup sequences, then it runs out of memory.

This PR changes the logic so that outgroup trimming is turned off when there are no outgroups. From what I can tell looking at the code, it's not doing anything in these cases other than sometimes running out of memory.

But @benedictpaten could you please take a quick look at the file delta here and confirm I'm not breaking anything by turning off outgroup trimming this way? Thanks!

benedictpaten commented 4 weeks ago

I this LGTM, i can't think of a reason why this would mess with anything...