Closed anshumanmohan closed 1 year ago
Sounds good; we should do something about this indeed, and normalizing the test outputs so they can be byte-for-byte compared sounds like the right thing.
See also the note in https://github.com/cucapra/pollen/pull/25#pullrequestreview-1322849570, however, about a general desire to avoid committing (most) GFA files to this repo.
I've cleaned up the repo, no longer keeping gfa/og files around for the purposes of testing emit
. I use a Turnt environment to serve as the oracle for emit instead of having a bespoke shell script. An interesting thing is that ODGI seems to throw away overlap information, replacing it with a *
. I've matched this behavior in mygfa
; see here.
The testing works as follows:
make fetch
grabs .gfa files from the webmake og
runs odgi build
on these to generate corresponding .og filesodgi view
to get a .gfa from it, and then pipes this to mygfa's emit
method.emit
method.Run all of these with make test-emit
.
This PR seeks to point out a potential problem and pose a relatively straightforward solution.
It would be nice if we could parse a GFA and then emit it unchanged, but the present
emit
method fails to do this for a silly reason.Many of our GFA files follow the format
although they do not strictly need to. Examples are k.gfa, note5.gfa, overlap.gfa, and t.gfa. The
emit
method in our mygfa.py parser also likes to produce lines of output in some order such as this, and so, after a little tweaking, I can parse and emit these graphs unchanged.Problems arise when some GFAs, like q.chop.gfa, have a more colorful intermix of these lines.
Indeed,
turnt
currently fails inpollen/slow_odgi/test/emit
on the graphs DRB1-3123.gfa, LPA.gfa, and q.chop.gfa because our emit method prefers a nice normalized form.One option would be to tweak the parser to keep track of these lines and then emit them in the order that we were read, but I wonder if we could just add a normalization pass to the GFAs: replace GFAs with equivalent GFAs in the nice format sketched above. I'd like to check if odgi would care and if the rest of the development would be affected.