cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

A flattened binary format for GFAs #150

Closed sampsyo closed 3 months ago

sampsyo commented 3 months ago

This is something I've been meaning to sketch for a long time, and finally hacked together a prototype. It is a Rust implementation of a flattened format for representing GFA data.

This MVP contains an in-memory format and some initial evidence that an on-disk binary file format is achievable without too much more work (by heavily relying on the zerocopy crate and a bunch of flat representation trickery). The prototype includes a byte-exact round-tripper for text GFA files (exploiting the rs-gfa crate for parsing and our own hand-rolled pretty-printer). The next steps are to finish off the reading and writing of binary files, and then to try implementing basic algorithms on top of this representation.

I will have more to say/write about this elsewhere when I get time, but though building this proof of concept, I am now hopeful that there is a path forward here to implement an interesting, flexible zero-copy binary format. And I suspect it will be fast.