benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

RUnning just R1 reads?? #933

Closed afletch00 closed 4 years ago

afletch00 commented 4 years ago

I am trying to analyze a data set that only the R1 reads are available, although it was sequenced as paired-end. Specifically (besides not adding any fnFrs, etc. with anything R in it) what do I need to consider when doing this? I have 18 samples. The sequencing qualities are also extremely poor. input quality plot.pdf I am losing >90% of my reads through the pipeline, even on very lose filters (maxEE=19, no other quality filtering except of course maxN=0 nad TruncQ=2). image Any ideas out there??

benjjneb commented 4 years ago

Wow that quality looks pretty terrible. I think the key first thing you need to do here is some data forensics on what exactly was sequenced. That is, what amplicon was sequenced (e.g. 16S, ITS?). What primers were used, and are the primers on these sequences? What is the expected length of the sequenced amplicon, and are some (many?) of the reads reading through the amplicon into the reverse primer and junk DNA beyond (which could explain some of the poor quality)?

Also, what is the output of plotComplexity(fn) on an example sample? To check for possible low quality infiltration.

afletch00 commented 4 years ago

Hi Ben, Thank you for your response. The primers were FWD <- "CTTGGTCATTTAGAGGAAGTAA" ITS1F REV <- "GCTGCGTTCTTCATCGATGC" ITS2 For ITS1 sequencing. I checked for primers and after filtering Ns there were only 58, so I did not run cutadapt, Also, they remved the primers before demultiplexing. SInce I have the demultiplexed data, I assume the primers are removed. I am re-analyzing some data that was published and just have the SRA files they downloaded to NCBI. They only provided the R1 reads. SInce it is ITS, reads can be from 100-301bp. I attached the quality plots. HT filtFs plot complexity.pdf

afletch00 commented 4 years ago

Let me add they never matched the paired ends for merge. Apparently Read 2 was never even analyzed. I’m really trying to figure out how they got the data they did, because it builds on the hypothesis for a project I am working on.

benjjneb commented 4 years ago

Could you share a single fastq file from a representative sample?

A couple things to look at would be to simply dereplicate the raw data file, and then inspect the most abundant sequences, e.g.:

drp <- derepFastq(fn)
top10 <- getSequences(drp)[1:10]
dada2:::pfasta(top10)

From there just to get a sense if this makes any sense at all I would BLAST the results: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome

It might also be interesting to try assigning taxonomy to the raw (dereplicated) data from a single sample to see what that looks like too. Is it mostly fungi as expected?

afletch00 commented 4 years ago

Thanks again Ben. Here is the file with the most raw reads: HT01_S63_L001_R1_001.fastq.gz I derep'ed all raw files. The top 10 BLASTed only came back with 5 hits- 4 for homo sapien and 1 for lactobacillis (standard nt BLAST). Nothing returned from the ITS BLAST. Top10 =

dada2:::pfasta(top10) 1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 2 CTTGGTCATTTAGAGGCAGATGGGTTATGCTGAAATCGGGGTCTTCGTATTCGTAATACTTATTGTCGCGGACCACGATTGGTGCGTTGTAGCCCAGGCGGTCGCTTCGCATCAGAACAAGGTCGTTCTGGAACGTGCGACGGCTCACAGGATTGCTTCGCCCTTCAAATTCAGCCAGCGCATCGGTACAGGCATCGATGAAGAACGCAGC 3 CTTGGTGCGGATCATGGTTCGCCCGTGGTGTTTGCCGAGGACGGCGACTACCTGGGAGCGTTTGCCGAAGAGTTCGGCGTGGATGCAATCAGCCTGAAGGGGGAATTTCAGATGAAGAACGCAGC 4 CTTGGTCATTTAGAGGCACACACAGCCACCACACTTCCTCCTTTACCTGCTGTATTCAAACATCGATGAAGAACGCAGC 5 CTTGGTCATTGTGCATGGAAGAAATTTCAAATCAAAGTGACGTTCTGATTGGATTTATCGATGAAGAACGCAGC 6 CTTGGTCATATTCTCCATGAGGGGAATGAGTGGCTCGTAGGCCTTGTCGAGCATCGATGAAGAACGCAGC 7 CTTGGAAGCAGCTCTTTCCCATTAAGCTTTGGATGAAGAACGCAGC 8 CTTGGTCATTGTTGGTGTATAGCAGTGCTACTAATTTGTGTACACTGATTTTGTATCCTGAAACTCTACTGAATTCATTTATCAGATCTAGGATCTTTTTGGATGAATCTTGAGGGTTTTCTGGGTATATGATCATATCATCGATGAAGAACGCAGC 9 CTTGGTCATTTAGAGGGGAGAGGATGGAAAGTCTGTGTGTGTAGGGCCGCTCCCGGGGACTCCAGGCAGAGGTGAAGATCCAGGGGCTTGGTCTTCTAGGAACTCAGCGCTGTCACTCCTTTCTCCTGGCGCAAAGCTCTCCACTGCCCAATTCTGTCCCCAGGGGACATTTGGCAGTATTTGAGCCAGGCATGCTGCTCAGTGCCCCCACTGTGCCCAGAACGGCCCAGTACAGAACGCAGC 10 CTTGGTTCATTGATAGTATCAAGACTGCTAATAATGGGAGAAAATAATGACTGATTTAGTCTATCGCTCAGTAATGCGCGATATTAAACAAAAAATACTTAATAATGAATATGAAGATATGCGTTTGCCTGATGAGCGAAGCCTAAGTGACTATTATCATGTAAGTCGTTCATCGATGAAGAACGCAGC

When trimming and filtering, I eventually ran: I did derep, remove Ns, maxEE=2, truncQ=2, minLen=50 actually since that was the average sequence lenght, and blasted the written FASTA files from the dadaF files.

I honestly just want to make sure it is faulty data and not my faulty code because I am trying to just look at R1 reads!

Thanks gain Ben.

benjjneb commented 4 years ago

This just looks like junk DNA. The by far most repeated sequences if jsut all Ns, and after that almost no sequences are repeated (see drp$uniques[1:20]). If you look at the sequences complexity, there is a large mode at low complexities that includes most of the seuqences (hist(seqComplexity(drp), 100)). And if you BLAST random sequences its junk or human or hits some random bacteria.

Maybe there is some actual signal there if you pull out only sequences that are from the ITS region by comparing to an ITS database? I dunno, but at a minimum the vast majority of this data is junk.

afletch00 commented 4 years ago

I ran it against UNITE and barely got any correct hits; and almost nothing beyond the Phylum designation...

Thank you so much Ben.