bioinf / Sibelia

Genome comparison via de Bruijn graph. To get the latest stable version, please visit our site.
http://bioinf.spbau.ru/sibelia
Other
64 stars 17 forks source link

about the manual : Resulting de Bruijn graph and customer set explanation #183

Open TsorngWeiWu opened 6 years ago

TsorngWeiWu commented 6 years ago

(1)

I have read your manual and found that format of "de_bruijn_graph.dot" should be explained more detailed.

If I give a simple format : A.fasta : ATC B.fasta : AT and custom parameter set as the following

1 2 2

After executing this command : Sibelia -k paraset -m 2 A.fasta B.fasta , the content of "de_bruijn_graph.dot is

image

I know the color meaning but how come the other like

0->2? 1->0? content in curly brackets??

(2) this eaxaple in manaul 1st pair: ... K1 ABCD K2 ... 2nd pair: ... K1 FGHE K2 ...

If the distance between K1 and K2 within each pair is less than D, then "Sibelia" replaces FGHE with ABCD to obtain longer "synteny block":

1st pair: ... K1 ABCD K2 ... 2nd pair: ... K1 ABCD K2 ...

More concrete example. Suppose that K = 3, D = 5 and somewhere in the genome we find:

1st pair: ... act gaga ggc ... 2nd pair: ... act gatg ggc ...

As we see, distance between "act" and "ggc" is less than 5 nucleotides so we replace "gatg" by "gaga":

1st pair: ... act gaga ggc ... 2nd pair: ... act gaga ggc ...

My question is how come if there is another 3rd pair ... act gattag ggc ... ??

iminkin commented 6 years ago

Hi,

If I give a simple format : A.fasta : ATC B.fasta : AT and custom parameter set as the following

The sequences in the input files should at least of length $k + 1$. My bad that Sibelia does not check for this, I will add it later. In your example the second file contains string of length 2.

I know the color meaning but how come the other like content in curly brackets??

The language used in the output is DOT, there are online manuals describing it: https://en.wikipedia.org/wiki/DOT_(graph_description_language)

My question is how come if there is another 3rd pair ... act gattag ggc ... ??

Then the bubbles will be simplified one by one. Supposed that in your example D is 7:

Initially:

act gaga ggc act gatg ggc act gattag ggc

First step:

act gaga ggc act gaga ggc act gattag ggc

Second step:

act gaga ggc act gaga ggc act gaga ggc

The choice of sequences to fill the branches of the bubbles is arbitrary, but it in this case it will result in the same synteny blocks no matter which branch is chosen.

iminkin commented 6 years ago

I will think about how to improve the manual, thanks for your suggestions.