edawson / gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.
http://edawson.github.io/gfakluge/
MIT License
51 stars 20 forks source link

GFA file of only gap records segfaults #30

Open sjackman opened 6 years ago

sjackman commented 6 years ago

A GFA file ought to include both segments and gap records. It'd be preferable if gfakluge didn't segfault when encountering such a file.

H   VN:Z:2.0
G   *   6+  50+ 121 58  FC:i:1
G   *   6+  225+    -57 58  FC:i:1
G   *   6+  298-    -83 8   FC:i:55
G   *   6-  62- -80 9   FC:i:47
G   *   6-  171-    -67 41  FC:i:2
❯❯❯ gfak stats -A gaps.gfa
[1]    98433 segmentation fault
edawson commented 6 years ago

Yikes. I assume we'd prefer an error (e.g. "Segment not found for gap ")?

Just to verify I understand correctly: this is not valid GFA, and we should never get GFA that has the records spread across multiple files like this, right?

sjackman commented 6 years ago

Short answer, yes. It's not valid GFA.

Long answer. ABySS produces a GFA file of the segment records and edge records. For large genomes this file can be quite large. In a second step, ABySS then uses the paired-end and mate-pair reads to estimate the distances between segments and outputs the gap records. Rather than make a copy of the potentially large S+E records, it outputs only the gap records. ABySS can handle reading a GFA file spread across multiple files for this reason. It'd be useful to me if Gfakluge could also read these split files. Your call of course whether you want to support that or not. It's easy enough to use either awk or abyss-todot (a misnomer now since it handles more than just GraphViz files) to combine these two GFA files into a single file for Gfakluge.

edawson commented 6 years ago

Interesting. How big are these two files?

I have been thinking about restructuring the command line tools to not build the GFAKluge object when the graph isn't being modified. When I get around to this I'll add support for breaking the graph into multiple files (with a stern warning, of course).

cating the gaps file to the seqs/edges file sounds like it might work as-is, unless I missed something.

edawson commented 6 years ago

I guess I should mention: tools that don't modify the graph are:

These tools would support abyss' split file format, with a warning. The rest of the tools should support the complete ( (S + E) + (G) ) file, even if it is very large, and should be able to handle it regardless of order. I didn't intend to enforce an order to GFA files in GFAkluge but it seems I've done it by accident for gap records (and probably edges as well).

sjackman commented 6 years ago

Interesting. How big are these two files?

For a human genome: FASTA: 2.9 GB S+E with * for sequences: 137 MB G: 10 MB

Thanks again, Eric!