indels at ends of DNA sequence

csmiguel commented 3 months ago

I have a pool of noisy sequences with two variants that differ at indels in their ends (see figure).

dada with default parameters only retrieves one variant but misses the other. I think it has to do with the ends-free concept in the alignments. After playing with the alignment parameters, setting "BAND_SIZE" = 0, seems to trigger nwalign_gapless in the C++ code and both variants are retrieved, along with other variants that are false positives at low frequencies and very low p.values (birt_pval). For comparison, I run in R the Needleman-Wunsch global alignment of these two sequences, which detects the end-gaps.

pwalign::pairwiseAlignment(p2rx1seqs[1], p2rx1seqs[2],
                  substitutionMatrix = mat,  type = "global",
                  gapOpening = 5, gapExtension = 2)

I run AmpliCI and it also finds both variants:

>H0;size=900.000;DiagP=0.00e+00;ee=0.265;
CCACCCTTCTCCTGGACCCACAGGGGAGGTAACATAATGGAATCCCCTTGGACCTTGGTGGTAATGTGGGATGAGCTTTCTGATTATGTTAATGATGTAATGTTTGACAGAGAGTCATACAAAATAACCCCCAGACACTGACAGTGCCATTGTTGGGCACTAGGCCTCTCCATCCCAACCTCCACTCCCCTGCCAGTGGACACTGGCACTTGTCAGCTAGGCAGAGACATAGATAAATGTGAAAGAAGGAGGCCGAGAGCAGAGGCCTCC
>H1;size=771.388;DiagP=0.00e+00;ee=0.384;
CCCACCCTTCTCCTGGACCCACAGGGGAGGTAACATAATGGAATCCCCTTGGACCTTGGTGGTAATGTGGGATGAGCTTTCTGATTATGTTAATGATGTAATGTTTGACAGAGAGTCATACAAAATAACCCCCAGACACTGACAGTGCCATTGTTGGGCACTAGGCCTCTCCATCCCAACCTCCACTCCCCTGCCAGTGGACACTGGCACTTGTCAGCTAGGCAGAGACATAGATAAATGTGAAAGAAGGAGGCCGAGAGCAGAGGCCTC

What exactly does "BAND_SIZE" = 0? how could I tune dada parameters to account for this variation at the end of the sequences without compromising the false negatives and false positives for other variants? Thanks, Miguel

csmiguel commented 3 months ago

I also encountered the same situation when discriminating these two variants in the read pool:

csmiguel commented 2 months ago

Band size == 0 solves the issue.

benjjneb / dada2

indels at ends of DNA sequence #1982