cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

Normalize GFAs? #26

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

This PR seeks to point out a potential problem and pose a relatively straightforward solution.

It would be nice if we could parse a GFA and then emit it unchanged, but the present emit method fails to do this for a silly reason.

Many of our GFA files follow the format

header lines beginning with H segment lines beginning with S path lines beginning with P link lines beginning with L

although they do not strictly need to. Examples are k.gfa, note5.gfa, overlap.gfa, and t.gfa. The emit method in our mygfa.py parser also likes to produce lines of output in some order such as this, and so, after a little tweaking, I can parse and emit these graphs unchanged.

Problems arise when some GFAs, like q.chop.gfa, have a more colorful intermix of these lines.

Indeed, turnt currently fails in pollen/slow_odgi/test/emit on the graphs DRB1-3123.gfa, LPA.gfa, and q.chop.gfa because our emit method prefers a nice normalized form.

One option would be to tweak the parser to keep track of these lines and then emit them in the order that we were read, but I wonder if we could just add a normalization pass to the GFAs: replace GFAs with equivalent GFAs in the nice format sketched above. I'd like to check if odgi would care and if the rest of the development would be affected.

sampsyo commented 1 year ago

Sounds good; we should do something about this indeed, and normalizing the test outputs so they can be byte-for-byte compared sounds like the right thing.

See also the note in https://github.com/cucapra/pollen/pull/25#pullrequestreview-1322849570, however, about a general desire to avoid committing (most) GFA files to this repo.

anshumanmohan commented 1 year ago

I've cleaned up the repo, no longer keeping gfa/og files around for the purposes of testing emit. I use a Turnt environment to serve as the oracle for emit instead of having a bespoke shell script. An interesting thing is that ODGI seems to throw away overlap information, replacing it with a *. I've matched this behavior in mygfa; see here.

The testing works as follows:

Run all of these with make test-emit.