broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 591 forks source link

Decouple dangling end rescue from assembly #5957

Open davidbenjamin opened 5 years ago

davidbenjamin commented 5 years ago

Currently we have fancy code to add artificial edges to the assembly graph in order to merge dangling paths back into the reference. This requires a lot of code and is hard to understand. It may be better to find haplotypes from a non-modified graph (we would need to be sure that the best haplotype finder doesn't reward dangling paths just for being short) and then pad the discovered haplotypes to occupy the same reference span.

jamesemery commented 5 years ago

I will start working on this off of the work that currently resides in #6034. The proposal will be to perform KBestHaplotype finding for multiple source/sink vertexes and then perform smith waterman on the resulting "dangling" haplotypes that are created in order to recover the probable dangling sequence. Hopefully the number of haplotypes will have been brought down by enough that this operation will be tolerable in terms of cost.