broadinstitute / pyfrost

Python bindings for Bifrost with a NetworkX compatible API
BSD 3-Clause "New" or "Revised" License
27 stars 1 forks source link

Unitig orientations #4

Closed aysunrhn closed 4 years ago

aysunrhn commented 4 years ago

Hi, I have a question about the unitig sequences and their orientations stored in the pyfrost graph.

I have noticed that unitig sequences are different in the pyfrost graph and the GFA output file. For instance, I have loaded a graph and in pyfrost I see I have a node with the sequence GTTATCTTTTCAGTTAATG on the reverse strand:

>>> g.nodes['GTTATCTTTTCAGTTAATG']['unitig_sequence']
'GTTATCTTTTCAGTTAATG'
>>> g.nodes['GTTATCTTTTCAGTTAATG']['strand']
Strand.REVERSE

However, there aren't any nodes with this unitig sequence in the GFA file, but its reverse complement exists:

S   270118  CATTAACTGAAAAGATAAC DA:Z:1

So I think the GFA output contains only the forward strands, is that correct?

lrvdijk commented 4 years ago

That's correct. The links in the GFA file refer to a segment and an orientation, so you only have to store the segment in one orientation.

In the NetworkX API they act as different nodes (because they are, one orientation has different out-/in-edges than the other orientation). This makes sure the builtin NetworkX graph traversals work correctly, and you're not traversing the graph in a way that's incompatible with the respect to strandness, and that an edge actually represents an k-1 overlap between two nodes.

aysunrhn commented 4 years ago

Ah ok I see. Thank you for the explanation! That clears up some other questions I had as well.