Instead of running a single minigraph -xggs construction call on all input sequences at once, it now runs a sequence of such calls on smaller batches, each in a separate Toil job.
The reasons for doing this are:
Each batch serves as a checkpoint for a possibly resuming a failed workflow from. For example, when aligning 200 sequences, if you run out of memory on the last one, then you don't need to realign all 200, rather it will start again at 150 (using default batch size of 50).
Smaller command lines. Cactus will sometimes use strings for parameter lists (as per its piping and docker etc logic). If you have enough input genomes, then it's easy to overrun a system limit like MAX_ARG_STRLEN which prevents the minigraph command from being run altogether. It also makes for messy logging.
Both these issues become more likely now that the HAL limitation on input genomes is fixed.
This is controlled in <graphmap minigraphConstructBatchSize="50"> in the config xml (and defaults to 50 as shown).
Instead of running a single
minigraph -xggs
construction call on all input sequences at once, it now runs a sequence of such calls on smaller batches, each in a separate Toil job.The reasons for doing this are:
MAX_ARG_STRLEN
which prevents theminigraph
command from being run altogether. It also makes for messy logging.Both these issues become more likely now that the HAL limitation on input genomes is fixed.
This is controlled in
<graphmap minigraphConstructBatchSize="50">
in the config xml (and defaults to 50 as shown).