cactus-pangenome originally had two types of output
giraffe-specific (.gbwt, .xg, .min, .dist) whose nodes were chopped to 1024bp
everything else (.gfa.gz, .vcf.gz, .vg, odgi) that had no node length limit.
A translation was kept between the two in a .trans file, which was later built into the .gbz format when we switched to that.
This was always confusing, as it was difficult to use vg to debug the gfa or vcf and sometimes giraffe would produce output in one coordinate system when you expected the other.
Now that .gbz is becoming a more general (not just for giraffe) interchange format, and the .dist file (which shares the node length limit) is starting to replace the old snarl format, having unchopped files hanging around only gets more confusing.
Also, a couple of near-future updates (path normalization and off-reference VCFs) will be working on .vg files (currently unchopped) but would need chopping for the distance index, and should be consistent with the chopped gbz.
Anyway, that's why this PR changes things to always chop to 1024bp right at the outset. This should guarantee all output files are node-id compatible with each other. If someone doesn't want to use vg they can use --unchopped-gfa to get an (explicitly) unchopped graph with the .unchopped.gfa.gz suffix. If someone prefers the old logic of having only the giraffe-related files chopped, they can set maxNodeLength to -1 in the config XML to bring it back.
cactus-pangenome
originally had two types of outputgiraffe-specific
(.gbwt, .xg, .min, .dist
) whose nodes were chopped to 1024bp.gfa.gz, .vcf.gz, .vg, odgi
) that had no node length limit.A translation was kept between the two in a
.trans
file, which was later built into the.gbz
format when we switched to that.This was always confusing, as it was difficult to use
vg
to debug thegfa
orvcf
and sometimesgiraffe
would produce output in one coordinate system when you expected the other.Now that
.gbz
is becoming a more general (not just forgiraffe
) interchange format, and the.dist
file (which shares the node length limit) is starting to replace the old snarl format, having unchopped files hanging around only gets more confusing.Also, a couple of near-future updates (path normalization and off-reference VCFs) will be working on
.vg
files (currently unchopped) but would need chopping for the distance index, and should be consistent with the choppedgbz
.Anyway, that's why this PR changes things to always chop to 1024bp right at the outset. This should guarantee all output files are node-id compatible with each other. If someone doesn't want to use
vg
they can use--unchopped-gfa
to get an (explicitly) unchopped graph with the.unchopped.gfa.gz
suffix. If someone prefers the old logic of having only thegiraffe
-related files chopped, they can setmaxNodeLength
to-1
in the config XML to bring it back.