algbio / themisto

Space-efficient pseudoalignment with a colored de Bruijn graph
GNU General Public License v2.0
50 stars 4 forks source link

Sequence and reverse complement generating different unitigs #10

Open krobison13 opened 2 years ago

krobison13 commented 2 years ago

If I give themisto-build a file with just a sequence and its reverse complement, extract-unitigs is generating two different unitigs -- is this the expected behavior?

e.g. if 80.fna contains

>k80
ATCAGCAGCGACATGGCGGTCATCACCGTAGTCGAGGCAAGCAATAATGGACGGCGCCCG
ACGTGGTCGATGATCGCAGA
>rc.k80
TCTGCGATCATCGACCACGTCGGGCGCCGTCCATTATTGCTTGCCTCGACTACGGTGATG
ACCGCCATGTCGCTGCTGAT

and then run themisto build -k 31 -i 80.fna -o 80.k31 --temp-dir . themisto extract-unitigs -i 80.k31 --colors-out 80.k31.colors --gfa-out 80.k31.gfa

I get a file with two lines in the colors file and two segments in the GFA file

H VN:Z:1.0 S 86 ATCAGCAGCGACATGGCGGTCATCACCGTAGTCGAGGCAAGCAATAATGGACGGCGCCCGACGTGGTCGATGATCGCAGA S 77 TCTGCGATCATCGACCACGTCGGGCGCCGTCCATTATTGCTTGCCTCGACTACGGTGATGACCGCCATGTCGCTGCTGAT

jnalanko commented 2 years ago

Yes, this is expected. Our index structure is not aware of reverse complements.

We could add a flag to extract-unitigs to compute the bidirected de Bruijn graph for better interoperability with other tools. Meanwhile, you can work around this by concatenating the input with its reverse complement before building the index. This will create two copies for each unitig: one for the forward and one for the reverse complement (except for those that are reverse complements of themselves). You can extract the bidirected de Bruijn graph from this with some post processing.