ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

Turn node chopping on for all output by default #1481

Closed glennhickey closed 2 months ago

glennhickey commented 2 months ago

cactus-pangenome originally had two types of output

A translation was kept between the two in a .trans file, which was later built into the .gbz format when we switched to that.

This was always confusing, as it was difficult to use vg to debug the gfa or vcf and sometimes giraffe would produce output in one coordinate system when you expected the other.

Now that .gbz is becoming a more general (not just for giraffe) interchange format, and the .dist file (which shares the node length limit) is starting to replace the old snarl format, having unchopped files hanging around only gets more confusing.

Also, a couple of near-future updates (path normalization and off-reference VCFs) will be working on .vg files (currently unchopped) but would need chopping for the distance index, and should be consistent with the chopped gbz.

Anyway, that's why this PR changes things to always chop to 1024bp right at the outset. This should guarantee all output files are node-id compatible with each other. If someone doesn't want to use vg they can use --unchopped-gfa to get an (explicitly) unchopped graph with the .unchopped.gfa.gz suffix. If someone prefers the old logic of having only the giraffe-related files chopped, they can set maxNodeLength to -1 in the config XML to bring it back.