cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
27 stars 1 forks source link

Slow odgi: `overlap` #31

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

This PR adds support for odgi overlap.

The command is straightforward and we diff out cleanly agains the present suite of test graphs.

One weird thing: the ODGI output is #path path_touched
name of query path number1 number2 name of touched path
name of query path number1 number2 name of touched path

but the middle two columns are not labeled, and exploring the relevant source code does not tell me very much. Hilariously, I found, just "by eye", that the second column is always 0 and the third is always the length of the sequence charted by the query path.

I think there is a bug in the odgi code. They want something else, and are trying to compute it, but are inadvertently computing the length of the query path instead. I'll explore further and report back here.

There are some computations in this overlap.py that I'd like to lift into preprocess.py. Also, there is a somewhat ugly bit of .sh hacking in my .turnt file. Thoughts about cleaning that up appreciated!

anshumanmohan commented 1 year ago

Meanwhile, a quick heads up for @susan-garry!

I have learned most of what I know about turnt environments from copying off you, but I wonder if you'd like to take a look at the style I now have going. It may help you avoid

The main trick I'm leveraging was taught to me by @sampsyo here:

The way Turnt works is that it looks for turnt.toml by walking "upward" from where the test file itself lives. That means that Turnt works the same way regardless of where you are "standing": running turnt foo.t and turnt ../../../foo.t and turnt bar/baz/qux/foo.t from different working directories (assuming those are all different paths to the same file) are all guaranteed to do the same thing.

sampsyo commented 1 year ago

I think there is a bug in the odgi code. They want something else, and are trying to compute it, but are inadvertently computing the length of the query path instead. I'll explore further and report back here.

Very very interesting! I will say again that this is a really useful outcome of doing some slow reference implementations: we debug the fast implementations (and their documentation). Needless to say, the docs for odgi overlap do not clarify what the columns are supposed to mean; hypothetically, a Python implementation should make that crystal clear.

anshumanmohan commented 1 year ago

Thanks a bunch Adrian, I learned a lot of hacky Python!! The turnt trick is real neat too, though now I have a really silly issue that I'll work on shortly. I need to figure out the bash/turnt incantation that will give me, for example, "q.chop.paths" from "q.chop.gfa" and "q.chop.og". My current solution is very silly and gives me "q.paths" (because I cut by "." and grab the first item).

anshumanmohan commented 1 year ago

What do you know, it was there all along. I needed the turnt-provided {base} instead doing all kinds of magic to {filename}.