broadinstitute / pyfrost

Python bindings for Bifrost with a NetworkX compatible API
BSD 3-Clause "New" or "Revised" License
27 stars 1 forks source link

unitig colors #7

Closed yezhenqing closed 2 years ago

yezhenqing commented 2 years ago

Greetings, I have a question about colours on unitig.

Here, I would like to use the toy case from the link (https://kamimrcht.github.io/webpage/tigs.html) to illustrate the issues as below:

image

Firstly, I created three fasta files:

blue.fa 
>blue1
ATAA

red.fa
>red1
ATAACAATTG

green.fa
>green1
ATAACAATTG
>green2
AACACC

Then, I generated the GFA file and the color file via the command:

   Bifrost build -k 4 -c -r blue.fa -r red.fa -r green.fa -o mytest_original

I didn't put other parameters like (-i -d) to avoid graph simplification.

By loading the GFA and color files into the python environment:

import pyfrost
g = pyfrost.load('mytest_original.gfa')

To my understanding, we have three colors here:

>>> g.graph
{'color_names': ['blue.fa', 'red.fa', 'green.fa'], 'k': 4, 'g': 2}

And we also can see there are six nodes in the graph g:

>>> for n, data in g.nodes(data=True):
...     print(n, data['strand'])
... 
ATAA Strand.FORWARD
TGTT Strand.REVERSE
ACAA Strand.FORWARD
CAAT Strand.REVERSE
ACAC Strand.FORWARD
GGTG Strand.REVERSE

Just ignore those nodes from Strand.REVERSE now, we should have three unitigs. And I tried to get the colors for each unitig:

>>> for n, data in g.nodes(data=True):
    strand = str(data['strand'])
    if(strand == 'Strand.FORWARD'):
        print('=======================')
        for k,v in data.items():
            if(k=='colors'):
                print(k, set(data['colors']))
            else:
                print(k, v)
... 
=======================
colors {0, 1, 2}
is_full_mapping True
unitig_length 3
mapped_sequence ATAACA
tail AACA
head ATAA
strand Strand.FORWARD
pos 0
unitig_sequence ATAACA
length 3
=======================
colors {1}
is_full_mapping True
unitig_length 4
mapped_sequence ACAATTG
tail ATTG
head ACAA
strand Strand.FORWARD
pos 0
unitig_sequence ACAATTG
length 4
=======================
colors {2, 3, 4, 5}
is_full_mapping True
unitig_length 2
mapped_sequence ACACC
tail CACC
head ACAC
strand Strand.FORWARD
pos 0
unitig_sequence ACACC
length 2  

What I'm quite confused about is those colors assigned to these unitigs. For example, I would expect the unitig with head-ATAA should have two colors (green and red) only. And furthermore, we only have three colors in total, why the unitig with head-ACAC has four colors {2,3,4,5}.

Maybe I misunderstood something about colors. Can you kindly explain a little bit how these colors been calculated?

Bifrost 1.0.6 pyfrost 0.1.0b1

Thank you for you work on this and any input would be appreciated.

yezhenqing commented 2 years ago

Sorry. need to correct something about my previous post.

To my understanding, the unitig with head-ATAA should have three colors, and the unitig with head-ACAA should have two colors, and the unitig with head-ACAC should have one color only...

yezhenqing commented 2 years ago

I noticed the Bifrost 1.0.6 and pyfrost 0.1.0b1 might be the old version and out of date.

So I git clone the newest version and compiled it locally. Now it works well.

But I noticed a minor issue as below:

>>> for n, data in g.nodes(data=True):
...     strand = str(data['strand'])
...     if(strand == 'Strand.FORWARD'):
...         print('=======================')
...         for k,v in data.items():
...             if(k=='colors'):
...                 print(k, set(v))
...             else:
...                 print(k, v)
... 
=======================
colors {0, 1, 2}
unitig_sequence ATAACA
tail AACA
length 3
strand Strand.FORWARD
mapped_sequence ATAACA
pos 0
is_full_mapping True
unitig_length 3
head ATAA
=======================
colors {1, 2}
unitig_sequence AATTGT         <----here I expected it would be "ACAATTG"
tail TTGT
length 3
strand Strand.FORWARD
mapped_sequence AATTGT
pos 0
is_full_mapping True
unitig_length 3
head AATT
=======================
colors {2}
unitig_sequence ACACC
tail CACC
length 2
strand Strand.FORWARD
mapped_sequence ACACC
pos 0
is_full_mapping True
unitig_length 2
head ACAC

And another small issue I also would like to point out for reference. I generated the ccDBG graph in two steps:

  Bifrost build -k 4 -c -r blue.fa -r red.fa -o B_R.step1

By loading B_R.step1 into python, and it output two unitags:

>>> for n, data in g1.nodes(data=True):
...     strand = str(data['strand'])
...     if(strand == 'Strand.FORWARD'):
...         print('=======================')
...         for k,v in data.items():
...             if(k=='colors'):
...                 print(k, set(v))
...             else:
...                 print(k, v)
... 
=======================
colors {0, 1}
unitig_sequence ATAACAATT
tail AATT
length 6
strand Strand.FORWARD
mapped_sequence ATAACAATT
pos 0
is_full_mapping True
unitig_length 6
head ATAA
=======================
colors {0, 1}
unitig_sequence ATAACAATT
tail AATT
length 6
strand Strand.FORWARD
mapped_sequence ATAACAATT
pos 0
is_full_mapping True
unitig_length 6
head ATAA

But these two unitags actually are same, probably it can be merged.

And then updated the B_R.step1 further:

  Bifrost update -k 4 -c -r green.fa -g B_R.step1.gfa -f B_R.step1.bfg_colors -o B_R_G.step2

"update" works fine. Very nice! Thanks for the cool tools!