broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 588 forks source link

Use Dijkstra's algorithm for finding best haplotypes from assembly graph #3561

Closed davidbenjamin closed 5 years ago

davidbenjamin commented 7 years ago

@vruano has pointed out that our method of finding the best haplotype paths in an assembly graph is equivalent to Dijkstra's algorithm for finding the shortest path in a directed graph, and that the latter is a much simpler implementation. [ Do I understand this correctly?].

We could simplify a bunch of code without changing the output of any tools by switching the implementation to Dijkstra's algorithm, which is implemented in jgrapht (our graphs extend this package's graph class) and apache commons.

@vruano has also pointed out that our definition of the best paths may not be optimal, which is a separate issue.

vruano commented 7 years ago

Dijkstra origina algorithm is about finding the single shortest route (or one of the in case of a tie), here we need the one that finds the K-shortest routes which is described here.

Is this one implanted in Jgraph? In that case, yes we could....

Otherwise if we have to implement the it from scratch... then there is no guaranteed the code is going to be simpler.... it could simpler just because I didn't bother to make the current one as simple as it could be.

davidbenjamin commented 7 years ago

jgrapht has an implementation for a single source to all other vertices. I think apache does the same.

vruano commented 7 years ago

However, the current algorithm and the k-dijkstra still would show the same problems in terms of doing a suboptimal selection of haplotypes in terms of their coverage of plausible variation.

I had implemented an alternative that fixed that issue... but I couldn't find the code ... perhaps just in my local machine (backups) need to find it.

In any case the idea is quite simple.... we simply simulate haplotypes based on those same furcation likelihoods and we stop when we have not discovered anything new for a while... the problem of such an approach is to make it deterministic. I guess we could fix a seed based on information on the current active region.... anyway,

vruano commented 7 years ago

jgrapht has an implementation for a single source to all other vertices

Here we rather need an algorithm that find the k-best paths from one vertice to another vertice. What you are describing sounds more like 1-best path from one vertice to any other vertice.