cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

Slow odgi edgy #29

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

Currently a draft, but this PR will add a few edge-aware odgi functions to slow-odgi

anshumanmohan commented 1 year ago

This PR adds functionality for odgi matrix and odgi validate. These methods, along with odgi degree, are the primary users of edges (aka links). Other than that, edges can mostly be given second-class treatment: paths carry the same information in a nicer format, assuming we do not worry about unused edges that were not on any path to begin with.

These three algorithms all benefit from the same precomputation. In one pass through the graph's links, I construct the two dicts ins and outs. They have the same type; one holds in-edges and one holds out-edges.

Dict[Tuple[str, bool], List[Tuple[str, bool]]]
# where 
# key: (segment name, orientation)               (my details)
# value: list of (segment name, orientation)     (my neighbors' details)
for link in graph.links:
    ins[(link.to, link.to_orient)].append((link.from_, link.from_orient))
    outs[(link.from_, link.from_orient)].append((link.to, link.to_orient))

The three algorithms fall out straightforwardly from these precomputed dicts.

anshumanmohan commented 1 year ago

Unfortunately, matrix and validate both have one case each where they do not diff out correctly against the odgi oracle.

I will investigate both of these further!

Re: testing and SE, there is another silly issue: I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.

anshumanmohan commented 1 year ago

Fixed the normalization issue in matrix easily enough.

The issue with validate was more subtle. When the GFA has the link L n1 + n2 + and a path needs to traverse ...n2-, n1-,..., odgi wants us to say that this is okay. I had been saying that this is not okay (and requiring a link that actually looked like L n2 - n1 -). I have weakened my check and I diff out correctly versus odgi validate now, but really I think this is kinda bananas and goes to show how silly edges are.

sampsyo commented 1 year ago

Wow; hella awesome! I'm glad this edge stuff didn't present overly much trouble!

These three algorithms all benefit from the same precomputation. In one pass through the graph's links, I construct the two dicts ins and outs. They have the same type; one holds in-edges and one holds out-edges.

FWIW, I might summarize this by calling it the "adjacency list" representation of a graph.

I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.

Ah, that is annoying: to summarize, both odgi and Turnt are extension-sensitive, so they are competing for the namespace here. One option might be to build up a teensy bit of infrastructure to put those temporary files into a temporary directory… although using mktemp(1) to make a file with a special extension is going to be pretty annoying. I wonder if we might be happiest just implementing something like https://github.com/cucapra/turnt/issues/23 in Turnt?

susan-garry commented 1 year ago

Re: testing and SE, there is another silly issue: I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.

Maybe the simplest solution is just to create a list of SLOW_OG_TEST_FILES and then run turnt on this list of actual test files. For example:

SLOW_OG_TEST_FILES := $(TEST_FILES:%=test/%.gfa)
// Equivalent to SLOW_OG_TEST_FILES = t.gfa k.gfa note5.gfa ... LPA.gfa

test-slow-validate: fetch
    test/perturb.sh
    -turnt -v --save --env validate_oracle $(SLOW_OG_TEST_FILES)
    turnt -v --env validate_test $(SLOW_OG_TEST_FILES)
anshumanmohan commented 1 year ago

Done with the tasks that fell out of your review @sampsyo! The solution for perturbing the graphs to generate new .gfas is rather naughty indeed, so please let me know what you think. If you think it's okay (given the alternatives), please hit merge!