Closed anshumanmohan closed 1 year ago
This PR adds functionality for odgi matrix
and odgi validate
. These methods, along with odgi degree
, are the primary users of edges (aka links). Other than that, edges can mostly be given second-class treatment: paths carry the same information in a nicer format, assuming we do not worry about unused edges that were not on any path to begin with.
These three algorithms all benefit from the same precomputation. In one pass through the graph's links, I construct the two dicts ins
and outs
. They have the same type; one holds in-edges and one holds out-edges.
Dict[Tuple[str, bool], List[Tuple[str, bool]]]
# where
# key: (segment name, orientation) (my details)
# value: list of (segment name, orientation) (my neighbors' details)
for link in graph.links:
ins[(link.to, link.to_orient)].append((link.from_, link.from_orient))
outs[(link.from_, link.from_orient)].append((link.to, link.to_orient))
The three algorithms fall out straightforwardly from these precomputed dicts.
Unfortunately, matrix
and validate
both have one case each where they do not diff out correctly against the odgi oracle.
matrix
has a simple normalization issue. In the sparse matrix representation of LPA.gfa, the order of nodes I present differs from the order that odgi presents in two instances (two rows of the matrix). The information itself is correct.validate
, my version reports a few more missing links than odgi does. This is only for the graph DRB1-3123.gfa. I will investigate both of these further!
Re: testing and SE, there is another silly issue: I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.
Fixed the normalization issue in matrix
easily enough.
The issue with validate
was more subtle. When the GFA has the link L n1 + n2 +
and a path needs to traverse ...n2-, n1-,...
, odgi wants us to say that this is okay. I had been saying that this is not okay (and requiring a link that actually looked like L n2 - n1 -
). I have weakened my check and I diff out correctly versus odgi validate
now, but really I think this is kinda bananas and goes to show how silly edges are.
Wow; hella awesome! I'm glad this edge stuff didn't present overly much trouble!
These three algorithms all benefit from the same precomputation. In one pass through the graph's links, I construct the two dicts
ins
andouts
. They have the same type; one holds in-edges and one holds out-edges.
FWIW, I might summarize this by calling it the "adjacency list" representation of a graph.
I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.
Ah, that is annoying: to summarize, both odgi and Turnt are extension-sensitive, so they are competing for the namespace here. One option might be to build up a teensy bit of infrastructure to put those temporary files into a temporary directory… although using mktemp(1) to make a file with a special extension is going to be pretty annoying. I wonder if we might be happiest just implementing something like https://github.com/cucapra/turnt/issues/23 in Turnt?
Re: testing and SE, there is another silly issue: I am occasionally forced to create temp files, and those have to have the .gfa extension in order for odgi to accept them as inputs. Then, because they have the .gfa extension, turnt tries to run tests on these temp files as though they were bona fide test files. Silly and a bit gross, and I'd appreciate any thoughts about cleaning up this workflow.
Maybe the simplest solution is just to create a list of SLOW_OG_TEST_FILES
and then run turnt
on this list of actual test files. For example:
SLOW_OG_TEST_FILES := $(TEST_FILES:%=test/%.gfa)
// Equivalent to SLOW_OG_TEST_FILES = t.gfa k.gfa note5.gfa ... LPA.gfa
test-slow-validate: fetch
test/perturb.sh
-turnt -v --save --env validate_oracle $(SLOW_OG_TEST_FILES)
turnt -v --env validate_test $(SLOW_OG_TEST_FILES)
Done with the tasks that fell out of your review @sampsyo! The solution for perturbing the graphs to generate new .gfas is rather naughty indeed, so please let me know what you think. If you think it's okay (given the alternatives), please hit merge!
Currently a draft, but this PR will add a few edge-aware odgi functions to slow-odgi