GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

Change output representation #42

Closed pashadag closed 5 years ago

pashadag commented 5 years ago

For some cases, the output edge representation of Bcalm2 is hard to work with. I will describe another natural representation below, and this "issue" is a feature request to add a flag to support this type of output.

Per this description here, there are four types of connections between a node x and y. Each connection corresponds to two directed edges, with a +/- sign at each node. Instead, lets have a representation where the connection type corresponds to an undirected edge with a 0/1 bit at each vertex. So,

Type 1: (u,v) with the bit at u = 1 and the bit at v = 0 Type 2: (u,v) with the bit at u = 0 and the bit at v = 1 Type 4: (u,v) with the bit at u = 1 and the bit at v = 1 Type 4: (u,v) with the bit at u = 0 and the bit at v = 0

This representation is consistent with the bidirected graph representation introduced by Kececioglu & Myers.

In terms of the fasta file header in bcalm output, that means converting L:<e.fromSign>:<e.to>:<e.toSign> in the header to J:<e.fromBit>:<e.to>:<e.toBit>, where the mapping is

e.fromSign e.toSign e.fromBit e.toBit
+ + 1 0
+ - 1 1
- + 0 0
- - 0 1

(I suggest using J instead of L as the tag, to avoid confusing downstream parsers? The J is arbitrary, could be something else)

pashadag commented 5 years ago

Amatur (from my lab) was going to work on this

rchikhi commented 5 years ago

Hi Paul, sure I don't have an objection for this feature request. Would you like me to code it or are you going to ask Amatur to do it? either way works

pashadag commented 5 years ago

Amatur will work on it -- I didn't find a way to assign it formally on github

pashadag commented 5 years ago

Rayan, Amatur will soon make a pull request with the changes -- we weren't sure how to proceed, so just let us know if the changes look OK. We can then update the README with KM (Kececioglu-Myers) representation option. We weren't sure about the flag name either, so feel free to change.

amatur commented 5 years ago

@rchikhi, to add this feature, I only had to make the changes in the gatb-core submodule (https://github.com/GATB/gatb-core). So I made a pull request there. Please check if this is how you planned to add this feature. To get KM (Kececioglu-Myers) edge representation in output, we need to run bcalm with "-edge-km 1" option.

rchikhi commented 5 years ago

Hi Amatur, thanks, the changes look good, and good job navigating gatb-core to find where to make the patch. I'll merge that pull request once I'm done fixing a different bug in bcalm. Hopefully you can for now work with a local copy. Rayan