cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
27 stars 1 forks source link

`slow-odgi inject` #51

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

This PR will add functionality for odgi inject. I am making it a draft so I can comment on my commits as they come in; I'll mark it ready when done.

anshumanmohan commented 1 year ago

odgi inject works as follows.

You have graph.gfa:

S 1 AAAA
S 2 TTTT
S 3 GGGG
P x 1+,2+,3+ *

And new_paths.bed:

x    0    8    y    

Running odgi inject will give you:

S 1 AAAA
S 2 TTTT
S 3 GGGG
P x 1+,2+,3+ *
P y 1+,2+ *

That is, you provide in the .bed file information about which path to track, and over which of its run to track it, along with a new path name. The result is that a new path is inserted, essentially a subpath of the original.

Here's the rub: what if the .bed file describes a legal subpath, but one that does not happen to line up the current segment-boundaries?

x    1    6    y    

We need to split segments 1 and 2 in order to make this work.

S 1 A
S 2 AAA
S 3 TT
S 4 TT
S 5 GGGG
P x 1+,2+,3+,4+,5+  *
P y 2+,3+   *

As you can see, this required edits to the path x as well.

anshumanmohan commented 1 year ago

I say it works as above, but I think there is something amiss either in my understanding or in the odgi code. See here if curious. I will proceed as per my understanding for now!

anshumanmohan commented 1 year ago

Given that things are up in the air re: our oracle, I have moved temporarily to an expect-testing style. The expect files, *.inj, line up with my explanation above.

  1. The h and i families show that the path range given in the .bed file is treated as relative to the path, not relative to some global FASTA-style record of the sequences of the segments.
  2. The j family tests the first, relatively easy situation shown above: the path-range to be tracked lines up nicely with the existing seams of the segments.
  3. The l family tests the second situation, where the path-range is legal but does not line up with existing segments. Chopping and renumbering is required on the segments, and previously existing paths need to be reworked. I use the word "chop" with some care: I totally anticipate some code reuse between this bit and odgi chop.
anshumanmohan commented 1 year ago

Basic functionality is in. We support the cases laid out by h, i, and j.

anshumanmohan commented 1 year ago

And, done with inject!

The last bit was tricky, but here's the basic plan:

  1. Run a transformation (in two passes) over the graph that performs two chops on the underlying graph if needed, without trying to add the new paths yet. This transformation does the housekeeping necessary to get the existing paths and such looking functionally the same as before.
  2. Now the hard case, where inject quietly requires a chop, devolves into the easy case that I have already solved. Just run that old code.

I was able to exploit chop's chop_paths method to do the heavy lifting.

anshumanmohan commented 1 year ago

The odgi issue seems resolved, and anyway had to do more with build than with inject, so I just went ahead and changed to the real test files. The little handmade expect-test files are still around, under test/handmade.

With the exception of DRB1-3123.gfa, I diff out correctly against all the GFA files under relatively rigorous testing (I generate a buncha new path-name requests in inject_setup.py). DRB1-3123 is unreadable enough that I need more time to figure out the issue...