cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

polbin: Actually dump FlatGFA binary files #152

Closed sampsyo closed 3 months ago

sampsyo commented 3 months ago

Now the polbin binary can both read and write two formats: text GFA files and "FlatGFA" binary files. So these four commands are possible:

$ polbin < something.gfa  # round trip through in-memory FlatGFA, print GFA to stdout
$ polbin -o cool.flatgfa < something.gfa  # convert GFA to FlatGFA
$ polbin -i cool.flatgfa  # print a FlatGFA file out as plain ol GFA, to stdout
$ polbin -i cool.flatgfa -o ice_cold.flatgfa  # glorified `cp`, no reason to do this

I also added test environments to check both kinds of round-tripping (through in-memory FlatGFA and through an on-disk file). It all works!!!!!

$ turnt -j -e polbin_mem -e polbin_file *.gfa
1..16
ok 1 - DRB1-3123.gfa polbin_mem
ok 2 - DRB1-3123.gfa polbin_file
ok 3 - LPA.gfa polbin_mem
ok 4 - LPA.gfa polbin_file
ok 5 - chr6.C4.gfa polbin_mem
ok 6 - chr6.C4.gfa polbin_file
ok 7 - k.gfa polbin_mem
ok 8 - k.gfa polbin_file
ok 9 - note5.gfa polbin_mem
ok 10 - note5.gfa polbin_file
ok 11 - overlap.gfa polbin_mem
ok 12 - overlap.gfa polbin_file
ok 13 - q.chop.gfa polbin_mem
ok 14 - q.chop.gfa polbin_file
ok 15 - t.gfa polbin_mem
ok 16 - t.gfa polbin_file

Conversion seems to be decently fast on these small examples. For our go-to big example, chr8.pan.gfa (4.2 GB), one run of conversion on my rapidly aging Intel iMac took 1m8s for parsing (GFA -> FlatGFA) and 1m44s for pretty-printing (FlatGFA -> GFA). Seems within the ballpark of reasonableness? (Moreover, the GFA seems to have round-tripped successfully. FWIW, just running diff to check took 22s.)