bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.04k stars 182 forks source link

Using diamond to find frameshifts in assemblies generated from IonTorrent data #435

Closed fplazaonate closed 3 years ago

fplazaonate commented 3 years ago

Hello,

I want diamond to find frameshifts in assemblies generated from IonTorrent data that occur at homopolymers.

I have generated a dummy example with a dna sequence and its corresponding protein.

>original_dna_sequence
ATGTCCGGTGCGCGTCCGAAAGTCAGAGGTCCGTACCATGTCGGTCGCCTTGTTGTGCTA
GCACAACTAAGATCCTCTGGTCTGGTGCAAGGCATGATAAGTCGTGCTAAAGTTACTCAT
AGCAAATTACAAAGTCTAAGATCATCGCGTTGCGCTGGGGTCCTTAAAGGAATGTTATTT
CAAATTGTCGTGATCCCCGGACGTGTTGGGTCACCAAAAAAGGTGTGCCCATACTTAATC
CTCCTTGTATATGGTAGGTGGGACCTGATCTCACAACCTGGGGTTGGGTATTGCAAATAA
>original_prot_sequence
MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLFQIVVIPGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK*

Then, I have generated a +1 frameshift by adding a C in the original nucleotide sequence

>plus_1_frameshift
ATGTCCGGTGCGCGTCCGAAAGTCAGAGGTCCGTACCATGTCGGTCGCCTTGTTGTGCTA
GCACAACTAAGATCCTCTGGTCTGGTGCAAGGCATGATAAGTCGTGCTAAAGTTACTCAT
AGCAAATTACAAAGTCTAAGATCATCGCGTTGCGCTGGGGTCCTTAAAGGAATGTTATTT
CAAATTGTCGTGATCCCCCGGACGTGTTGGGTCACCAAAAAAGGTGTGCCCATACTTAATC
CTCCTTGTATATGGTAGGTGGGACCTGATCTCACAACCTGGGGTTGGGTATTGCAAATAA

In this case diamond fails to detect the frameshift:

Score = 132.1 bits (331),  Expect = 1.7e-36
 Identities = 70/79 (88%), Positives = 70/79 (88%), Gaps = 0/79 (0%)
 Frame = 1

Query    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 180
            MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF
Sbjct    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 60

Query  181  QIVVIPRTCWVTKKGVPIL 237
            QIVVIP      KK  P L
Sbjct   61  QIVVIPGRVGSPKKVCPYL 79

I have tried to add another C to generate a +2 frameshift:

>plus_2_frameshift
ATGTCCGGTGCGCGTCCGAAAGTCAGAGGTCCGTACCATGTCGGTCGCCTTGTTGTGCTA
GCACAACTAAGATCCTCTGGTCTGGTGCAAGGCATGATAAGTCGTGCTAAAGTTACTCAT
AGCAAATTACAAAGTCTAAGATCATCGCGTTGCGCTGGGGTCCTTAAAGGAATGTTATTT
CAAATTGTCGTGATCCCCCCGGACGTGTTGGGTCACCAAAAAAGGTGTGCCCATACTTAATC
CTCCTTGTATATGGTAGGTGGGACCTGATCTCACAACCTGGGGTTGGGTATTGCAAATAA

In this case diamond detects the frameshift:

 Score = 192.6 bits (488),  Expect = 1.1e-54
 Identities = 100/101 (99%), Positives = 100/101 (99%), Gaps = 1/101 (0%)
 Frame = 1

Query    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 180
            MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF
Sbjct    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 60

Query  181  QIVVIP/PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK* 302
            QIVVI  PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK*
Sbjct   61  QIVVI--PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK* 100

Could you explain me the rationale behind this and how to fix this issue?

Thank you very much for your help, Florian

bbuchfink commented 3 years ago

The problem occurs due to masking of short ORFs. Use --min-orf 1 to get the correct alignment:

Length=100

 Score = 195 bits (495),  Expect = 8.90e-72
 Identities = 100/100 (100%), Positives = 100/100 (100%), Gaps = 0/100 (0%)
 Frame = 1

Query    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 180
            MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF
Sbjct    1  MSGARPKVRGPYHVGRLVVLAQLRSSGLVQGMISRAKVTHSKLQSLRSSRCAGVLKGMLF 60

Query  181  QIVVI\PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK* 301
            QIVVI PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK*
Sbjct   61  QIVVI-PGRVGSPKKVCPYLILLVYGRWDLISQPGVGYCK* 100
fplazaonate commented 3 years ago

Thanks Benjamin. I guess this should not happen by working with full contigs? Best, Florian

bbuchfink commented 3 years ago

It can still happen if the segment between stop codons is too short, safer to use the option I suggested.