benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Using dada2 for single end reads #795

Closed s160398j closed 5 years ago

s160398j commented 5 years ago

Can I use dada2 for analysis if I have single end reads?

benjjneb commented 5 years ago

Yes. Everything will run the same as the tutorial workflow except skip the reverse read parts and read merging, and form the sequence table out of the dadaFs object, i.e. makeSequenceTable(dadaFs).

s160398j commented 5 years ago

Thanks a lot!

On Thu, Jun 20, 2019 at 12:36 PM Benjamin Callahan notifications@github.com wrote:

Yes. Everything will run the same as the tutorial workflow except skip the reverse read parts and read merging, and form the sequence table out of the dadaFs object, i.e. makeSequenceTable(dadaFs).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/795?email_source=notifications&email_token=AMNC5XD4S5EAS6HJD3DCFBDP3O5YVA5CNFSM4HZYRTM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGDG6A#issuecomment-504116088, or mute the thread https://github.com/notifications/unsubscribe-auth/AMNC5XC3TO6RH6ZBXJPAVHLP3O5YVANCNFSM4HZYRTMQ .

AnaMariaCabello commented 4 years ago

Dear Ben, I'm using dada2 for the analysis of 454 sequencing data. I have 3 questions: 1) My data comes from 3 different runs; should I run dada2 separately for each run? I know that is the way to proceed with Illumina data. I don't know if that would affect the learning error process in the case of 454 too. 2) When demultiplexing in QIIME (split_libraries.py), my data already passed a quality control based on Phred. Should I run also the filterAndTrim with these quality related (maxEE, trunQ)parameters? : outF <- filterAndTrim(fnFs, filtFs, truncLen=xxx, maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) I was thinking on using "filterAndTrim" command just to filter by length and maxN=0. 3) From your tutorials, I saw that only minor modifications should be done in the dada2 script for 454 data: dada(..., HOMOPOLYMER_GAP_PENALTY=-1, BAND_SIZE=32). Is there something else to take into account? In my case, I'm dealing with V9+ITS1 amplicons so I will skip the homopolymer_gap_penalty because my ITS have natural homopolymers of more than 3 bases. Is that correct?

Thank you very much, Ana Cabello

benjjneb commented 4 years ago
  1. Yes.

  2. Yes. Arguably even better would be to turn off the QIIME filtering, but it's still worth doing the dada2 filtering step either way.

  3. No, you still want to include the HOMOPOLYMER_GAP_PENALTY=-1 parameter. This is because of the high homopolymer error rate in 454 data, and that gap penalty improves the alignments of different sequences when that is the case.

bellahrabe commented 4 years ago

Dear Ben, I am working with reverse, single-end reads. Can those be used in the same way as the forward single-end reads? Thanks for your help! Isabella

benjjneb commented 4 years ago

Yep, no different really. You should probably include the assignTaxonomy(..., tryRC=TRUE) flag for taxonomy though, as the reverse reads will commonly be in the reverse complement orientation relative to the reference database.

bellahrabe commented 4 years ago

Great! Thank you so much for this swift response. Works perfectly.

AnaMariaCabello commented 4 years ago

Dear Ben, I'm using dada2 for the analysis of 454 sequencing data. I was sequencing the V9 region of the 18S plus the ITS1. The sequencing was performed from the 5.8S region upwards, so we got a variety in read lengths (200-600bps) and some reads did not cover the v9 region completely. "Learning errors" with this dataset takes longer and shows worse error plots (error_plots_raw) than learning errors just with the selection of reads that covered completely the v9 region (error_plots_long_sequences). I tried also to learn errors selecting only the V9 region from those long reads to see if I could even get lower error frequencies, but I'm not sure how to interpret this last error plot (error_plot_v9region). I would appreciate your opinion about which would be the best option for doing "learning errors". Thank you very much! Ana ps: usually, long reads (containing complete v9) represent ca. 50% of reads that passed the quality filtering. I hope that amount of reads is still a good sample representative of the of dataset in order to use them for learning errors. filtered covering V9 sample_1 13151 6042 sample_2 5500 2193 sample_3 11566 5403 sample_4 2308 1063 sample_5 5668 2298 sample_6 2706 459

error_plots_raw.pdf Error_plots_long_sequences.pdf Error_plot_V9region.pdf

benjjneb commented 4 years ago

In general, subsetting data or choosing samples of better quality or lower complexity is a valid approach to learnErrors. So your approach here is fine in principle. The one concern is having enough data once you subset down to your V9 spanning reads, but it seems like you have already thought of that. However, what I see in your error-rate plots is a much higher density of data and a more consistent set of observations (points) and fit to those points by the model (black line), so I think I would lean towards using the full data for that reason.

454 data often had a pretty flat error-profile because of the error modes of that technology which were dominated by indel errors.

AnaMariaCabello commented 4 years ago

Thank you so much Ben for helping me to understand. Sorry, I noticed I attached the wrong "error_plot_raw". Please find attached the good one, although I think you will suggest the same. So, is not that bad that the error frequency doesn't go below 10(E-3) around the highest quality scores?; is it more important to have a better fit of the points to the model rather than get lower error frequencies? I though that the error plot using the subset of long sequences was a good approach (but I guess there is not a black or white answer for this). Error_Plot_DadaRun7.pdf

benjjneb commented 4 years ago

So, is not that bad that the error frequency doesn't go below 10(E-3) around the highest quality scores?

That was typical for 454 data in my experience.

is it more important to have a better fit of the points to the model rather than get lower error frequencies?

Yes, having a good model fit is the first thing to look for. The second is that the overall model looks like what is expected from that sequencing technology.

AnaMariaCabello commented 4 years ago

Great! Thank you so much!

nmor683 commented 4 years ago

Hi there, I’m attempting to analyse some 454 sequence data (V3 region of 16S rRNA), but I’m having trouble with the filtering step: none of my reads pass the filter. These are the parameters that I am using:

out <- filterAndTrim(fnFs, filtFs, truncLen=0, maxN=0, maxEE=2, truncQ=2, maxLen=200, minLen= Inf, rm.phix=TRUE, compress=TRUE, verbose = T, multithread=FALSE)

I have tried increasing the maxEE value, but this didn’t make a difference. I am unsure which other parameters could be changed, or what to change them to?

Thank you, Natalie

quality profiles for raw reads

benjjneb commented 4 years ago

@nmor683

minLen= Inf

That flag is requiring each read to be at least Infinity bases long to be kept. AKA no reads will pass the filter.

nmor683 commented 4 years ago

Thank you so much!

AnaMariaCabello commented 3 years ago

Dear Ben, Is the parameter HOMOPOLYMER_GAP_PENALTY=-1 still available in dada version 1.14? I was trying to retrieve it when typing the function dada(derepFs, err = errF, HOMOPOLYMER_GAP_PENALTY=-1 , BAND_SIZE=32...) but it didn't come out; no results are retrieved either when: ??HOMOPOLYMER_GAP_PENALTY. I'm just asking because the dada function took 45min to run on my 454 run plate. Is there any problem on keep using dada version 1.14?

Also, I would like to apply the collapseNoMismatch function to my data. I got ITS sequences of variable length (ca. 600-300) with a minLength set to 300bps. Would you recommend to apply this function?

Thank you!

benjjneb commented 3 years ago

@AnaMariaCabello

It exists in a special class of algorithmic arguments that you can inspect using the ?setDadaOpt documentation.

Also, I would like to apply the collapseNoMismatch function to my data. I got ITS sequences of variable length (ca. 600-300) with a minLength set to 300bps. Would you recommend to apply this function?

Was the data trimmed to span from the forward to reverse primers? If so, then the length variation is biological, and doesn't require collapseNoMismatch. But if the length variation is technical (e.g. variation in 454 sequence lengths) then yes it would be worth running.

AnaMariaCabello commented 3 years ago

Great! thank you so much!. The lenght variation is technical!

AnaMariaCabello commented 3 years ago

Dear Ben, I'm using dada2 for the analysis of 454 sequencing data. My data comes from 3 different runs (sequencing plates), so I run dada2 separately for each one. After filtering, error learning and dereplication I obtained the sequence table for each run (“makeSequenceTable”), and then I merged the 3 tables: “st.all <- mergeSequenceTables(st1, st2, st3)”. After that I proceed to remove chimeras “removeBimeraDenovo”.

I noticed that the sum of the number of total final reads obtained in each run (run 1+run2+run3= 231499 reads) was lower that the total reads obtained in the merged table (= 198723 reads), so the merged table had 15% less reads. The merged table contained all the samples included in the 3 runs, so no problems with samples names…is there any explanation for this? Thank you, Ana Cabello

benjjneb commented 3 years ago

I think the reason is that you removed chimeras on the merged table?

Prior to chimera removal and immediately after mergeSequenceTables the numbers should be equal, but after chimera removal on the merged table there will be fewer reads remaining.

AnaMariaCabello commented 3 years ago

That's the problem. The numbers are not equal immediately after "mergeSequenceTables" (prior to chimera removal).

benjjneb commented 3 years ago

Can you isolate the code that leads to this issue, including the output at each step?

I.e. post the code that counts the reads in each table, then the code to merge them, then the code that counts the reads in that merged table. Include the outputs, and any other messages that arise.

AnaMariaCabello commented 3 years ago

Dear Ben, I found an error on my code, now is solved and read numbers make sense. I have one last question. Should I expect the number of ASVs in the merged table to be the same that the sum of ASVs numbers in the 3 previous tables?. I do not get the same number (1206 ASVs vs. 1185 ASVs in the merged table) but I though that the function "mergeSequenceTables" was collapsing identical ASV sequences. Is this correct?

Sorry for any inconvenience and thanks for your time.

benjjneb commented 3 years ago

The sum of the reads should be the same.

I do not get the same number (1206 ASVs vs. 1185 ASVs in the merged table) but I though that the function "mergeSequenceTables" was collapsing identical ASV sequences. Is this correct?

Correct. My concern now is actually that so few (1206-1185=21) common ASVs are being found between your 3 runs. Is that expected? Or am I misinterpreting the numbers you posted?

aff30 commented 2 years ago

Hi Ben,

I am new in bioinformatics and I am running my first samples. I have been having a problem with the quality of my data where the reverse reads are very bad. I tried to find some explanation on the internet that would explain to me why only the reverse reads are bad. Would you mind briefly explaining to me that, if you know the answer? Or perhaps send me some links of articles where I could find this info? Thank you very much for your help.

benjjneb commented 2 years ago

@aff30 I don't have any documentation to send you, but it is broadly understood that the reverse reads on Illumina are typically of worse quality, and that depending on how the library is prepared and the instrument is performing, they can be much worse. So what you are seeing isn't that unusual, although it would make me reach out to the sequencing provider and start to consider alternatives.

aff30 commented 2 years ago

Yeah, I also did not find anything, but thank you very much for your answer!