claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Error tolerance of cutadapt should also be a parameter #6

Open claraqin opened 4 years ago

claraqin commented 4 years ago

Currently, error tolerance is assumed to be default -e 0.1. Changing the error tolerance could allow more primer sequences to be identified (and thus removed) by cutadapt.

See https://cutadapt.readthedocs.io/en/stable/guide.html#error-tolerance

To be fair, this might not be the most consequential parameter. Error tolerance of 0.1 corresponds to maximum mismatch of 2 bases. Allowing up to 3 mismatched bases would have removed an additional 138 reverse-primer matches from the reverse reads – approximately 10% of all the matches (with max.mismatch=3) in the pre-cutadapt reverse reads.

runB69PP, first R1 sample
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     423          0       0       3
FWDPrimer.R2.reads      26          0       0      13
REVPrimer.R1.reads     647          0       0     298
REVPrimer.R2.reads     737          0       0       4

runB69PP, first R1 sample, setting max.mismatch = 2
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     731          0       0       4
FWDPrimer.R2.reads      31          0       0     314
REVPrimer.R1.reads    1518          0       0     807
REVPrimer.R2.reads    1196          0       0      13

runB69PP, first R1 sample, setting max.mismatch = 3
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     757          0       0       4
FWDPrimer.R2.reads      31          0       0     343
REVPrimer.R1.reads    1528          0       0     944
REVPrimer.R2.reads    1342          0       0      19

runB69PP, post cutadapt (1 of 2), first R1 sample
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       0          0       0       3
FWDPrimer.R2.reads      26          0       0       0
REVPrimer.R1.reads     646          0       0       0
REVPrimer.R2.reads       0          0       0       4

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 2
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       0          0       0       4
FWDPrimer.R2.reads      31          0       0       0
REVPrimer.R1.reads    1509          0       0       0
REVPrimer.R2.reads       0          0       0      13

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 3
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       1          0       0       4
FWDPrimer.R2.reads      31          0       0       5
REVPrimer.R1.reads    1519          0       0     132
REVPrimer.R2.reads     138          0       0      19
mykophile commented 4 years ago

Hi Clara - one of the graduate students in my lab, Glade, found that one of the keys for him was setting the TruncQ parameter higher than the default. While this may not seem critical, his reasoning was that with ITS seqs there is a long, low quality tail that can cause reads to get rejected later. If you are more aggressive with trimming early that actually saves more down the line. Here are the settings he found worked best:

out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2), truncQ = 9, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE) Best, Kabir

On Mar 26, 2020, at 9:00 PM, Clara Qin notifications@github.com<mailto:notifications@github.com> wrote:

Currently, error tolerance is assumed to be default -e 0.1. Changing the error tolerance could allow more primer sequences to be identified (and thus removed) by cutadapt.

See https://cutadapt.readthedocs.io/en/stable/guide.html#error-tolerance

To be fair, this might not be the most consequential parameter. Error tolerance of 0.1 corresponds to maximum mismatch of 2 bases. Allowing up to 3 mismatched bases would have removed an additional 138 reverse-primer matches from the reverse reads – approximately 10% of all the matches (with max.mismatch=3) in the pre-cutadapt reverse reads.

runB69PP, first R1 sample Forward Complement Reverse RevComp FWDPrimer.R1.reads 423 0 0 3 FWDPrimer.R2.reads 26 0 0 13 REVPrimer.R1.reads 647 0 0 298 REVPrimer.R2.reads 737 0 0 4

runB69PP, first R1 sample, setting max.mismatch = 2 Forward Complement Reverse RevComp FWDPrimer.R1.reads 731 0 0 4 FWDPrimer.R2.reads 31 0 0 314 REVPrimer.R1.reads 1518 0 0 807 REVPrimer.R2.reads 1196 0 0 13

runB69PP, first R1 sample, setting max.mismatch = 3 Forward Complement Reverse RevComp FWDPrimer.R1.reads 757 0 0 4 FWDPrimer.R2.reads 31 0 0 343 REVPrimer.R1.reads 1528 0 0 944 REVPrimer.R2.reads 1342 0 0 19

runB69PP, post cutadapt (1 of 2), first R1 sample Forward Complement Reverse RevComp FWDPrimer.R1.reads 0 0 0 3 FWDPrimer.R2.reads 26 0 0 0 REVPrimer.R1.reads 646 0 0 0 REVPrimer.R2.reads 0 0 0 4

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 2 Forward Complement Reverse RevComp FWDPrimer.R1.reads 0 0 0 4 FWDPrimer.R2.reads 31 0 0 0 REVPrimer.R1.reads 1509 0 0 0 REVPrimer.R2.reads 0 0 0 13

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 3 Forward Complement Reverse RevComp FWDPrimer.R1.reads 1 0 0 4 FWDPrimer.R2.reads 31 0 0 5 REVPrimer.R1.reads 1519 0 0 132 REVPrimer.R2.reads 138 0 0 19

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/claraqin/NEON_soil_microbe_processing/issues/6, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC5DJWWSQ6BBRMX2JX7C6QLRJQQFFANCNFSM4LUW33NQ.

Kabir Peay Associate Professor Dept. of Biology Stanford University (650) 723-0552

claraqin commented 4 years ago

Hi Kabir,

If I'm not mistaken, I think you're bringing up a separate issue, so I've just opened up a new issue (#7) and tagged you in it.

Best, Clara

claraqin commented 4 years ago

Just a note that while I haven't made the cutadapt error tolerance into another parameter, I did modify utils.R in commit f3c1119 so that the hard-coded error tolerance -e was 0.2 instead of the default 0.1. This allows up to 4 mismatched bases instead of the default 2, which is necessary because of the low primer identification rates under the default error tolerance.

I'm actually not sure whether it would be a good idea to make this into a parameter, so I'll leave this issue open until we can reach more of a consensus.