cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

Fixes in `slow-odgi flip` #52

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

There was a little more to flip than I had thought: when the algorithm flips a path, it also adds links to the graph in support of the newly-flipped path.

Splitting hairs a little, there are two options for when such links should be added:

  1. Exclusively in support of newly-flipped paths.
  2. In support of all paths, newly-flipped and otherwise.

This is being discussed in https://github.com/pangenome/odgi/issues/496. Arguably this distinction doesn't even matter, since it only really comes up when the input graph is invalid.

Anyway, I currently do (1) and odgi seems to do (2). I am happy to switch to (2); it's just a matter of relaxing the proposition passed to gen_links so it reads lambda _: True. The handmade file flip4.gfa shows this divergence.

We also continue to diverge on the matter of note5.gfa; let's see how that shakes out. Otherwise we diff out correctly versus odgi.

More details in the commit messages if curious!

anshumanmohan commented 1 year ago

tl;dr:

  1. If we change odgi flip as I propose (option 1 above), I believe note5 will also get fixed to my preference: running flip on it will become a no-op.
  2. However, it can be argued that the change that flip is making on note5 is a no-op, in which case it may be okay to leave as-is. I think this is stylistically iffy.

Fuller explanation

Check out note5.gfa. I claim this graph does not need to be flipped; the correct response is a no-op. Observe also that this is a valid graph.

In https://github.com/pangenome/odgi/pull/485#issuecomment-1496693803 I show what odgi flip does to the graph. The thing is, I think odgi considers those two links to be equal, and, in a manner of speaking, is doing a no-op. Slipping briefly into slow-odgi terminology, I think that in general,

Link(
  Handle(from, from_dir),
  Handle(to, to_dir)
)

is considered equal to

Link(
  Handle(to, not to_dir),
  Handle(from, not from_dir),
)

So really what's happened with note5 is:

  1. The link L 3 - 4 + 0M was already in the graph (you can see it on line 11).
  2. flip looked to see which paths needed flipping, and found the answer to be none.
  3. flip scrapped the links and began to build them afresh, with the paths (all old, in this case) for reference.
  4. For some reason L 4 - 3 + 0M was proposed, and it was added to the list of links. I'm not sure why it was proposed, since ...,4-,3+,... does not appear on any path (...,3-,4+,... does).
  5. Maybe at some point thereafter, L 3 - 4 + 0M was proposed but rejected by some kind of dedup checker on account of a clash with L 4 - 3 + 0M.

If this behavior is okay, I'll move on, I promise! My feeling stylistically is that it should be a no-op plain and simple, not a no-op modulo Link-rev-equality. FWIW, this is achieved "for free" if we adopt my option 1 above. Anyway, not a hill to die on!

sampsyo commented 1 year ago

Interesting! That equivalence is an interesting observation.

It does occur to me that this may be further invocation that edges should ideally be second-class citizens—mainly reflections of the set of paths, which are what really matter.