Unpaired reads due to random orientations in F and R reads

droush commented 7 years ago

I am processing 2x300 MiSeq 16s data using the 341F-806R primers. Initially I removed the sequencing primers using BBduk as ambiguous bases were causing issues with chimera checking (17 and 20 bp respectively. My initial trimming and filtering after import into DADA2 was a 10 bp on the 5' end of both reads, and then truncation at 275, and 235 of the forward and reverse reads due to quality. I end up with a nice 45 bp overlap for pairing further down the pipeline.

Here is where the problems begin. After running sample inference on the forward and reverse reads, I end up with ~800 variants for ~90,000 unique sequences for the forward, and only 120 variants for ~90,000 uniques for the reverse, which seems quite odd. When I move forward to merge pairs, we end up with only a quarter of the reads in each sample are successfully paired, and if I increase mismatches that increases as expected, but past a threshold I end up with a large amount of sequences as chimeras (as expected).

Do you guys have any insight into why the discrepency between the forward and reverse files is occuring, or have any suggestions on dealing with this issue. I think I might need to just use the forward reads as the reverse might have too low of quality in the overlapping region.

Thank you!

benjjneb commented 7 years ago

I have a couple of guesses, but they would just be guesses. The easiest way to proceed would be if you could share say 1 or 2 of your fastqs (both F & R). Is that possible?

droush commented 7 years ago

I sent a link to the email listed on the DADA2 website. Thank you for the help.

On Thu, Mar 9, 2017 at 12:20 PM -0700, "Daniel Roush" dwroush@asu.edu wrote:

Yes no problem. https://www.dropbox.com/sh/ua3h5m7c3kq1euj/AADsp_ywEc8HW8gBMqwLu0Exa?dl=0

pw= dada2 On Thu, Mar 9, 2017 at 12:03 PM, benjjneb notifications@github.com wrote:

I have a couple of guesses, but they would just be guesses. The easiest way to proceed would be if you could share say 1 or 2 of your fastqs (both F & R). Is that possible?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

-- Daniel Roush | PhD Student, MicrobiologyGarcia-Pichel Lab Arizona State University(505) 444-0238 | dwroush@asu.edu | @Roushomics

benjjneb commented 7 years ago

So, at first look it appears that your reads are in a mixture of orientations.

DADA2 expects the Forward reads to be in one direction and the Reverse read to be in the opposite direction. However both the forward and reverse reads are in a mixture of orientations, and that is interfering with merging (and sample inference).

Do you have some ideas on why this might be happening?

droush commented 7 years ago

I guess my first question is how did you figure that out so quickly!

The workflow I have was download full R1/R2 files from basespace, split into per sample FASTQ, trim off primers, and import. I can drop the raw basespace data into the dropbox folder as well. I'm not sure how the reads could get mixed in that workflow.

On Thu, Mar 9, 2017 at 7:35 PM, benjjneb notifications@github.com wrote:

So, at first look it appears that your reads are in a mixture of orientations.

DADA2 expects the Forward reads to be in one direction and the Reverse read to be in the opposite direction. However both the forward and reverse reads are in a mixture of orientations, and that is interfering with merging (and sample inference).

Do you have some ideas on why this might be happening?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/205#issuecomment-285556140, or mute the thread https://github.com/notifications/unsubscribe-auth/AWPV53ijuCdjB0GlDuXdqbgQQubyxsJ6ks5rkLbkgaJpZM4MYb_1 .

-- Daniel Roush | PhD Student, Microbiology Garcia-Pichel Lab Arizona State University (505) 444-0238 | dwroush@asu.edu | @Roushomics

benjjneb commented 7 years ago

Everything is done in R so everything can be inspected along the way, provided you wrote the package and know where everything is stored!

I'm going to label this as enhancement for now, to see if there is a simple way for us to add the ability to correct mixed orientations.

benjjneb commented 7 years ago

341F primer: CCTACGGGNGGCWGCAG 805R primer: GACTACHVGGGTATCTAATCC

droush commented 7 years ago

Thanks Ben this is really helpful. I took a look at some of the data before trimming the primers and it still had issues with pairing. The raw data I recieved off basespace had had some demultiplexing done to at least put our samples in one fastq file, but not split per sample.

droush commented 7 years ago

Found a post about this very issue on the Qiime forums. I think I'll try to do some reorientation and try to get it up and working over the weekend.

https://groups.google.com/forum/#!topic/qiime-forum/Ukxlr2O9hqo

Thanks for all the help!

On Fri, Mar 10, 2017 at 9:23 AM, benjjneb notifications@github.com wrote:

341F primer: CCTACGGGNGGCWGCAG 805R primer: GACTACHVGGGTATCTAATCC

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/205#issuecomment-285713821, or mute the thread https://github.com/notifications/unsubscribe-auth/AWPV5_Fjr_iqkwgFE8aEG94wY4qKW5Thks5rkXkNgaJpZM4MYb_1 .

-- Daniel Roush | PhD Student, Microbiology Garcia-Pichel Lab Arizona State University (505) 444-0238 | dwroush@asu.edu | @Roushomics

giriarteS commented 7 years ago

When I was constructing the tree and I got a duplicated sequence:

IMPORTANT WARNING: Sequences SV1 and SV2892 are exactly identical

I checked back the duplicated SVs and notice that they have different orientation. Maybe, the reads can be reoriented before the inference? I reoriented all the SVs before the alignment to build the tree. seqs <- OrientNucleotides(seqs) alignment <- AlignSeqs(seqs, anchor=NA)

I just discarded SV2892 because the abundance was really low.

benjjneb commented 7 years ago

Going to close this as wont-fix from our end.

It looks like @giriarteS had some good suggestions though, the DECIPHER package has some useful tools in it.

csmiguel commented 6 years ago

I am having the exactly same problem to @droush with MiSeq 300PE: a ~80-90% drop in variants found in R2 reads compared to R1. I suspect this has to do with the high expected errors (output from learnErrors) for R2 reads. Therefore, dada2 would need higher divergence between sequences to determine high-confident variants (@benjjneb does it sound right?). Regarding the low merging rates, I am having the same issue using external workflows, such as PEAR, so I guess it is related to something intrinsic to the sequences. Probably, due to the low homology at the low-quality 3' ends. I believe it has nothing to do with sequence orientation. After running cutadapt on R1 and R2 reads with 341F and 805R primers, I find them as expected on the R1 and R2, respectively, and get no trimming when using their reverse-complements.

So: Should I use only variants from R1 for downstream diversity analysis? Am I missing something?

Please, find the R1 and R2 here with primers already removed with cutadapt: https://www.dropbox.com/sh/6o5g48h583fxvwp/AABz4Og3Sug_1LIpgMGu0Pmfa?dl=0

benjjneb commented 6 years ago

Therefore, dada2 would need higher divergence between sequences to determine high-confident variants (@benjjneb does it sound right?)

Yep, or higher numbers of reads from the variants. The result is that you lose resolution of low frequency variants in the high-error reverse reads, and you end up with a lot fewer variants in R2.

Should I use only variants from R1 for downstream diversity analysis? Am I missing something?

In practice, Illumina 2x300 sequencing has very poor quality over the last 100 nucleotides of the reverse reads. Ideally you want to cut off the low quality tails to improve inference (see above), but when you have to merge reads you are limited in how much you can cut off.

Using just the high-quality forward reads is usually preferable when the reverse read sequence quality simply isn't good enough to get successful merging.

Some tips when going after longer amplicons on Illumina: (1) Think of 2x300 as more like 300F and 200R bases, as in most cases that is closer to what you'll get in high-quality bases. (2) Look for high-quality sequencing providers. There is huge variation between different places. (3) Prefer library setups that don't include the primers in the sequenced reads, as that wastes bases on the primers that you'd rather have extending the amplicon into the overlap region.

csmiguel commented 6 years ago

Thank you @benjjneb for your fast and accurate reply and your sequencing tips. I am working with your great workflow (10.12688/f1000research.8986.2) only on R1 now. Anyways, I am bit surprised I was having so much trouble merging (16S amplicon without primers is ~430nt) or finding variants in R2 since quality is not so bad after filterAndTrim(fnFs[1], filtFs[1], fnRs[1], filtRs[1],maxN=0,maxEE=c(3,6), truncQ=2,compress=TRUE, multithread=TRUE,truncLen=c(275,180)) (or even setting more stringent maxEE).

However, apparently, using shorter sequences (ie. truncating R1 to 250nt) instead of R1-R2 merged (~430nt), should yield accurate results, as pointed by previous research (https://www.nature.com/articles/nature24621/figures/7).

PD: which library setup avoids sequencing 16S primers?

benjjneb commented 6 years ago

I am bit surprised I was having so much trouble merging (16S amplicon without primers is ~430nt) or finding variants in R2 since quality is not so bad ...

Agreed, I'm somewhat surprised too. Not sure what's going on there. Could the amplicon be longer than expected, at least in some major taxa?

PD: which library setup avoids sequencing 16S primers?

One such is the EMP protocol: sequencing reads start from the end of the primers, so primers aren't included on the reads.

csmiguel commented 6 years ago

Thanks Benjamin

sghignone commented 6 years ago

Hello, and sorry if I rise again this thread, but it's the most similar to my question mark. I recently came across DADA2 after years of QIIME experience. I am probably still too attached to the QIIME workflow, so I am a bit perplexed...In my latest experiments, my workflow was: extract_barcodes ->> assemble paired reads with pear ->> convert_fastaqual_fastq.py ->> split_libraries.py [in two steps: 1) using FWD primer on pear assembled reads; 2) using REV primer to assign reads to each samples] ->> adjust_seq_orientation.py to fix orientation of sequences where REV primer was found, so to get all sequences oriented in the same direction ->> proceed into qiime/vsearch

I am wondering if there's a way to have all sequences with the same direction (based on primer recognition) before to run primer trimming with cutadapt (bearing in mind the ITS DADA2 workflow)

benjjneb commented 6 years ago

@sghignone

There may be. The orient.fwd parameter in filterAndTrim (renamed from primer.fwd in 1.8) allows you to specify a fixed string that appears at the start of the forward read, and the filterAndTrim function will re-orient reads in which it is found at the start of the reverse read rather than the start of the forward read (it will also filter out reads where is isn't found in either place). This is meant to be used when your primers were sequenced. If that works, you should be OK. If not, you may need to go outside the dada2 package for re-orientation.

sghignone commented 6 years ago

Thanks for the hint! unfortunately my fwd primer has ambiguities, and the function you cited "Only allows unambiguous nucleotides." I will look for something based on vmatchPattern, which recognises the IUPAC ambiguity code, to have the reads matching the REV primer at 5' complemented and reversed.

benjjneb commented 6 years ago

@sghignone

You can use just the start of the forward primer, before any abiguities come up. That's how I usually use it.

benjjneb / dada2

Unpaired reads due to random orientations in F and R reads #205