ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

Paths dropped in clipped MC graph #1311

Open jooppenh opened 8 months ago

jooppenh commented 8 months ago

I used cactus-pangenome (Cactus v2.7.2) to make a graph with two samples supplied to --reference, and then used Giraffe to align Illumina reads to the clip graph and surject to the non-backbone reference.

I noticed that resulting bam only had portions of some chromosomes present, specifically those where the chromosome starts at position 0.

Here's an example from this happening on chromosome 7 in the graph. In the clip gfa, there are two W-lines:

W       bisBis  0       7       0       9728194
W       bisBis  0       7       9839261 109738145

But in the path metadata from the GBZ (which I think is generated from the gfa?), only the first of these segments is present:

bisBis#0#7      REFERENCE       bisBis  0       7       NO_PHASE_BLOCK  NO_SUBRANGE
bisBis#0#7[9839261]     REFERENCE       bisBis  0       7       NO_PHASE_BLOCK  9839261

For comparison, the path metadata looks slightly different in chromosomes where the start is also clipped, with the subranges all specified.

bisBis#0#6[135995]      REFERENCE       bisBis  0       6       NO_PHASE_BLOCK  135995
bisBis#0#6[6067444]     REFERENCE       bisBis  0       6       NO_PHASE_BLOCK  6067444
bisBis#0#6[55909666]    REFERENCE       bisBis  0       6       NO_PHASE_BLOCK  55909666

I was hoping there might be an easy way to either: a) modify paths in the clip GBZ to add the other W-lines of chromosomes without clipped starts that don't end up as paths or b) perform the clipping from the full graph in such a way where these segments aren't dropped

Thanks!

glennhickey commented 8 months ago

Hi, I see what you mean.

I think the confusing thing here is that GBZ does not store path end coordinates, so the conversion from GFA is lossy in this sense. (though end coordinates can in theory be recovered from walking the path).

To make matters more confusing, vg paths is not distinguishing between a path with no subrange and a path that starts at 0, which leads the subrange and [0] to be dropped as in your example.

I agree with you that this is a bug: your starting subpath is a subpath and should be explicitly reported as such with [0]. The bad news is that by losing end coordinates, it's tough for vg to handle this case. The good news is that this is somewhat related to the vg issue here: https://github.com/vgteam/vg/issues/4240 and I think @adamnovak is working on a major overhaul that will solve it. @adamnovak can you please bear this case in mind while doing so?