cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
27 stars 1 forks source link

Try out a minimal GFA text format parser #20

Closed sampsyo closed 1 year ago

sampsyo commented 1 year ago

Context: We have previously tried these ways of reading GFA files: pygfa, gfapy, going through odgi itself and interacting with its API through its Python FFI wrapper, and (at one point, I think) rolling our own ad hoc parser (my memory is hazy but I think @dc854 did this?). All of these have been kind of unsatisfactory in their own way—the Python libraries have broken in mysterious ways, talking to odgi has the ordinary practical challenges of any Python FFI, and our hand-rolled text parser was pretty algorithm-specific and didn't have a general Python data model.

What this is: I tossed together an extremely simple Python parser for GFA files. The goals are:

This seems to work in the sense that it doesn't crash while parsing the GFA files in our current test suite. Here's a quick test:

for fn in `ls test/*.gfa` ; do echo $fn ; python3 mygfa.py < $fn > /dev/null ; done

Most files go fast; chr8.pan also works but takes ~16 minutes on my laptop.

Why you should care: Two reasons:


Perhaps this needs to go into the pollen package; for now it's at the top level just to show you what it looks like.