Open sjackman opened 6 years ago
Yikes. I assume we'd prefer an error (e.g. "Segment not found for gap
Just to verify I understand correctly: this is not valid GFA, and we should never get GFA that has the records spread across multiple files like this, right?
Short answer, yes. It's not valid GFA.
Long answer. ABySS produces a GFA file of the segment records and edge records. For large genomes this file can be quite large. In a second step, ABySS then uses the paired-end and mate-pair reads to estimate the distances between segments and outputs the gap records. Rather than make a copy of the potentially large S+E records, it outputs only the gap records. ABySS can handle reading a GFA file spread across multiple files for this reason. It'd be useful to me if Gfakluge could also read these split files. Your call of course whether you want to support that or not. It's easy enough to use either awk
or abyss-todot
(a misnomer now since it handles more than just GraphViz files) to combine these two GFA files into a single file for Gfakluge.
Interesting. How big are these two files?
I have been thinking about restructuring the command line tools to not build the GFAKluge object when the graph isn't being modified. When I get around to this I'll add support for breaking the graph into multiple files (with a stern warning, of course).
cat
ing the gaps file to the seqs/edges file sounds like it might work as-is, unless I missed something.
I guess I should mention: tools that don't modify the graph are:
These tools would support abyss' split file format, with a warning. The rest of the tools should support the complete ( (S + E) + (G) ) file, even if it is very large, and should be able to handle it regardless of order. I didn't intend to enforce an order to GFA files in GFAkluge but it seems I've done it by accident for gap records (and probably edges as well).
Interesting. How big are these two files?
For a human genome: FASTA: 2.9 GB S+E with * for sequences: 137 MB G: 10 MB
Thanks again, Eric!
A GFA file ought to include both segments and gap records. It'd be preferable if gfakluge didn't segfault when encountering such a file.