benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

removePrimers misses valid primer matches in certain circumstances #989

Open gmagoon opened 4 years ago

gmagoon commented 4 years ago

Currently, using removePrimers with orient=TRUE appears to miss valid primer matches in certain circumstances. Specifically, it seems that in the current removePrimers logic, if there is a fwd-only primer match on the original (non-reverse-complement) sequence, a full (fwd+rev) match on the reverse complement will not be considered. This can lead to unexpected behavior, e.g. "loosening" of matching thresholds can result in fewer passing read sequences. I believe #956 should fix this.

benjjneb commented 4 years ago

Thanks for opening this issue, somehow your pull request slipped through my to-do list.

Request: Can you provide a minimal example of the specific problem being addressed here? E.g. fastq file(s) with like 3 sequences, one of which should be handled better with this new logic?

gmagoon commented 4 years ago

Sure @benjjneb ... I'll plan to get this to you, hopefully this weekend.

benjjneb commented 4 years ago

Did we follow up on this over email?

gmagoon commented 4 years ago

No, sorry, I haven't got round to this yet. I'll probably return to looking at it in a month or so and I'll try to follow up then...

gmagoon commented 8 months ago

Very sorry for the long delay, @benjjneb . I've finally got around to putting together some test sets that show the issue. I'm hoping this falls under the "better late than never" category!

To demonstrate, I'm following the outline of this procedure, using a subset of the data.

> library(dada2);packageVersion("dada2")
Loading required package: Rcpp
Warning message:
package ‘Rcpp’ was built under R version 4.2.3 
[1] ‘1.26.0’
> F27 <- "AGRGTTYGATYMTGGCTCAG"
> R1492 <- "RGYTACCTTGTTACGACTT"
> rc <- dada2:::rc
> path <- "~/dada2_issue989_test"
> fns <- list.files(path, pattern="fastq.gz", full.names=TRUE)
> fns
[1] "C:\\Users\\Gregory.Magoon\\Documents/dada2_issue989_test/Callahan_16S_R_3.1.ccs99.9.fastq.gz.1"
> nopsA <- file.path(path, "noprimersA", basename(fns))
> primA <- removePrimers(fns, nopsA, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, max.mismatch=2, verbose=TRUE, compress=FALSE)
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
15162 sequences out of 30983 are being reverse-complemented.
Overwriting file:C:\Users\Gregory.Magoon\Documents\dada2_issue989_test\noprimersA\Callahan_16S_R_3.1.ccs99.9.fastq.gz.1
Read in 30983, output 26605 (85.9%) filtered sequences.
> nopsB2 <- file.path(path, "noprimersB2", basename(fns))
> primB2 <- removePrimers(fns, nopsB2, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, max.mismatch=4, verbose=TRUE, compress=FALSE, allow.indel=TRUE)
Primer matching with indels allowed is currently significantly (~4x) slower.
Creating output directory: C:/Users/Gregory.Magoon/Documents/dada2_issue989_test/noprimersB2
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
11506 sequences out of 30983 are being reverse-complemented.
Read in 30983, output 25469 (82.2%) filtered sequences.
> nopsD <- file.path(path, "noprimersD", basename(fns))
> primD <- removePrimers(fns, nopsD, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, max.mismatch=5, verbose=TRUE, compress=FALSE)
Creating output directory: C:/Users/Gregory.Magoon/Documents/dada2_issue989_test/noprimersD
Multiple matches to the primer(s) in some sequences. Using the longest possible match.
16094 sequences out of 30983 are being reverse-complemented.
Read in 30983, output 29564 (95.4%) filtered sequences.
>

The above set of commands filters with three different parameter sets. The first, noprimersA, uses max.mismatch=2 and allow.indel=FALSE (the default values). The other two represent relaxed matching that, ideally, should increase sensitivity by identifying and removing primers from more reads (leading to more reads in the output). In noprimersB2, I used max.mismatch=4 and allow.indel=TRUE, while noprimersD uses max.mismatch=5 while still leaving allow.indel=FALSE.

Inspecting the output, I found 4370 reads missing from noprimersB2 that were properly processed in noprimersA. Likewise, there are 7 reads missing from noprimersD that were properly processed in noprimersA. These are the reads represented in the attached test sets. I believe #956 should allow most of these reads to be properly trimmed at the corresponding parameters, although I can't easily test this patch at the moment. That is, most of the "in_A_not_B2" set should be properly trimmed by the noprimersB2 parameters, and most of the "in_A_not_D" set should be properly trimmed by the noprimersD parameters.

It seems the issue is that as the matching parameters are loosened, spurious (coincidental) matches might be identified that lead to counterintuitive and undesirable results. For cases where the true match is on the reverse-complement sequence, I believe what can happen is that when inspecting the forward read sequence, one of the primers can give a spurious hit, but the other one doesn't (as expected since the true match is on the reverse complement)...It seems the current algorithm quits at this point and omits the read from the output, whereas my patch is intended to prompt it to continue looking by trying the reverse-complement (as it would if neither primer gave match on the forward read sequence).

I should note that there can still be issues, when loosening the matching parameters, if BOTH the primers give spurious hits to the forward read sequence. This leads to the still undesirable (though more intuitive) result of mis-trimmed sequences, generally having unexpected length. (It seems a small fraction of the reads have this issue with the noprimersB2 parameters.) However, the most straightforward way to address this that I can think of would be to have the algorithm consider both the forward and reverse-complement for ALL reads, then and choose whichever, if any, gives the best quality matches. And that would be expected to increase run time by ~50%, I believe.

Callahan_16S_R_3.1.ccs99.9.in_A_not_B2.fastq.gz Callahan_16S_R_3.1.ccs99.9.in_A_not_D.fastq.gz