edawson / gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.
http://edawson.github.io/gfakluge/
MIT License
51 stars 20 forks source link

[FEATURE REQUEST] Update diff to use MinHash-based sequence comparison #45

Open edawson opened 5 years ago

edawson commented 5 years ago

The gfak diff command is currently a very (very) dumb comparison tool. I've finally got around to drawing up a useful way to compare two graphs, which may or may not have identical ID spaces. I don't want to implement full graph-graph alignment as it's going to be difficult and crazy expensive. Instead, I'm going to go with the following process.

Roughly:

  1. Check the number of S/E lines and check if they're within some threshold of each other. If this condition is not met, return false.
  2. Next, create MinHash sketches of all the S lines in the two graphs. Perform all vs all comparison on S lines above some length.
  3. Sort the all-v-all sketch comparisons by their approximate Jaccard index. In descending order, go through these matches, recording the ~ percent ID between an S line and its most-similar match in the other graph. This yield an overall similarity for the sequence-space of the graph.
  4. Now that we have an approximate alignment between the graphs, we can compare the connectivity of the individual segments. From this, we can determine if the two graphs are identical.

I'd also like to implement a way to output percent identity between two graphs, and I think this is a step on the way to doing so.