Open jooppenh opened 8 months ago
Hi, I see what you mean.
I think the confusing thing here is that GBZ does not store path end coordinates, so the conversion from GFA is lossy in this sense. (though end coordinates can in theory be recovered from walking the path).
To make matters more confusing, vg paths
is not distinguishing between a path with no subrange and a path that starts at 0, which leads the subrange and [0]
to be dropped as in your example.
I agree with you that this is a bug: your starting subpath is a subpath and should be explicitly reported as such with [0]
. The bad news is that by losing end coordinates, it's tough for vg to handle this case. The good news is that this is somewhat related to the vg issue here: https://github.com/vgteam/vg/issues/4240 and I think @adamnovak is working on a major overhaul that will solve it. @adamnovak can you please bear this case in mind while doing so?
I used
cactus-pangenome
(Cactus v2.7.2) to make a graph with two samples supplied to--reference
, and then used Giraffe to align Illumina reads to the clip graph and surject to the non-backbone reference.I noticed that resulting bam only had portions of some chromosomes present, specifically those where the chromosome starts at position 0.
Here's an example from this happening on chromosome 7 in the graph. In the clip gfa, there are two W-lines:
But in the path metadata from the GBZ (which I think is generated from the gfa?), only the first of these segments is present:
For comparison, the path metadata looks slightly different in chromosomes where the start is also clipped, with the subranges all specified.
I was hoping there might be an easy way to either: a) modify paths in the clip GBZ to add the other W-lines of chromosomes without clipped starts that don't end up as paths or b) perform the clipping from the full graph in such a way where these segments aren't dropped
Thanks!