Open ekg opened 5 years ago
Thanks @ekg for the detailed comment, this is good to know for us and in general anyone who is fan of the vg tool :)
We'll need some time to re-run the experiments with the settings you recommend using our test instances. Happy to share the results as soon as I get them.
Thank you for a beautiful piece of work that is certain to have a long lasting impact on this nascent field! I look forward to carefully examining it and expect to find many applications for your method in future work.
I should apologize for making some annoying comments on twitter about there being bugs in vg's long read alignment model. I thought it'd be better if I clarified them with some analyses based on your datasets. twitter does not have space for this. I thought of making a gist, but perhaps it's fine to discuss this here. Sorry for the length.
There's nothing to fix and nothing for you to do. This post is just to satisfy my need to explain what you found when evaluating vg. @maickrau brought it to my attention and I was able to lodge a resolution.
First, some documentation about vg map that's missing because it's only in my thesis. (Having written this I think I should post it on the vg wiki...)
vg map for long reads
vg map uses a heuristic alignment model for long reads that is somewhat unusual. In it, reads are chunked up into overlapping pieces of a fixed length (for instance 256bp, which is the default, and is meant to keep the mode from getting invoked on short reads, or 128bp, which I've found to work better). Each chunk is multimapped using vg's MEM-based alignment model, wherein a weighted alignment MEM DAG with weights related to MEM collinearity in the read and the graph is built and the best alignments are pulled out with a kind of max-sum dynamic programming algorithm. Then, the chunks are unified with a similar kind of model (call it the alignment chain DAG) that's applied to whole alignment chunks rather than MEMs, yielding a series of alignments that string together specific mappings for a subset of the chunks of the original read. Finally, unaligned bits in this chained alignment are "patched" by a banded alignment of the unaligned portions into the subgraphs downstream or upstream of the nearest mapped chunks in the read.
As you've discovered, this approach is complicated and tricky to get right, and not at all as elegant as an exact alignment algorithm. But it's functional, and I applied it to a number of problems in my PhD studies, including progressive graph construction and a long read to complex graph experiment in the vg NBT paper. One key motivation for the design is that it allows alignment of long reads to graphs of arbitrary complexity, and in principle it can allow for the detection of any kind of structural variation (although the heuristic parameter of the chunk size interacts with the scale of SV that you can detect).
One problem that took me a long time to appreciate is that this model will only work properly if the chunks inside the read are aligned with something like a global alignment algorithm. We want them to be full length, so that he chaining will work and allow the full length alignment to be found as a path through the alignment chain DAG. This issue isn't so bad when you're progressively aligning high quality chromosomes with 1-2% maximum divergence. It becomes much more noticeable with error rates of 10-15% in single molecule reads. Using a local alignment algorithm, which was default in v1.9, results in many failed or heavily softclipped alignments in this case.
evaluating changes in vg long read alignment on L2 and L3
To demonstrate this, I've realigned L2.fq and L3.fq against LRC.vg.
The setup:
vg version v1.16
I've aligned the sets in "default" mode, which uses the same scoring parameters as minimap2 (
-u 2 -q 1 -z 2 -o 2 -y 1
), and should be fairly similar to how you ran it in the paper. The second mode is "long", and is equivalent to-u 2 -q 1 -z 2 -o 2 -y 1 -L 63 -w 128 -O 32
.Here, the
-L 63
provides a bonus that encourages full length alignment.-w 128
and-O 32
change the overlap parameters to be more sensitive. Any other changes from what you evaluated are bugfixes to the alignment code that are shared between both models.Alignment and table generation:
Resulting in histograms of the identity and score metric for L2
And L3:
summary
I can't exactly replicate your result in the paper with this change, but I think this is very suggestive of the same pattern you found, that for L2 vg-heuristic wasn't doing very well, and for L3 it was basically failing to get full length mappings in many cases. To be certain it's the same as what you saw, I'll have to run the alignments with PaSGAL, which #1 blocks me from doing for the moment.
Please take this as you will. It's all really just to note that the heuristic alignment model in vg isn't fundamentally incapable of aligning the long reads. But, you're totally correct: the parameterization in vg v1.9 certainly was, and I can confirm that in the case of this simulated data. Thanks for bearing with me while I document this!