To avoid bad PlotQualityProfiles and a lot of time for the calculation of error rates

MDSHAHADATHOSSAIN commented 2 years ago

Hello, I have four fastq files named 18S191096-CZ_R1.fastq, 18S191096-CZ_R2.fastq, 18S191096-NEW_R1.fastq and 18S191096-NEW_R2.fastq. Also, the size of the 18S191096-CZ_R1.fastq, 18S191096-CZ_R2.fastq, 18S191096-NEW_R1.fastq and 18S191096-NEW_R2.fastq files is 227 MB, 227 MB, plotQualityProfile_R1 2.5 GB and 2.5 GB, respectively. plotQualityProfile_R2 I have attached the figures of plotQualityProfile here. In plotQualityProfile you can see that I already have way too many reads. mydata_errF <- learnErrors(mydata_filtFs, nbases= 1e+08, multithread=TRUE, verbose=TRUE) mydata_errR <- learnErrors(mydata_filtRs, nbases= 1e+08, multithread=TRUE, verbose=TRUE) When I try to estimate error rates using the above codes, it takes too long because of the many bases and reads. For example, the R session ran continuously for a day and then terminated without finishing the tasks.

What should I do now? How can I make the quality of plotQualityProfile good? How can I quickly estimate the error rate? I would be grateful for any suggestion. Thank you.

benjjneb commented 2 years ago

The relevant data size for learnErrors is after doing filtering and trimming. What filterAndTrim command did you apply to the data? One option is to make filtering more stringent, which will trade away lower quality reads for reduced processing time later.

That said, learnErrors is often the most compute-intensive part of the DADA2 workflow. Especially in high-diversity data. (what environment/primer-set are you using?)

MDSHAHADATHOSSAIN commented 2 years ago

Thank you for your reply. I used this filter and trim option: mydata_out <- filterAndTrim(mydata_fnFs, mydata_filtFs, mydata_fnRs, mydata_filtRs, truncLen=c(130,120), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=FALSE) The PCR primer pairs (forward/reverse) were 341F/805R and NS1/GCfung, targeting the 5' end of eukaryote 18S rRNA.

benjjneb commented 2 years ago

Is the data a mixture of V3V4 16S sequences, and 18S sequences from fungi? That makes things more complicated.

What is the length of the sequenced eukaryotic amplicon? This should be considered when choosing truncLen parameters. You want to keep enough DNA so that they overlap. Your current truncLen lenghts are too small to get overalpping 16S sequence, as the V3V4 amplicon is ~420 (or ~460 nts if primers sequenced) long.

MDSHAHADATHOSSAIN commented 2 years ago

No. I understand now. The data comes from 18S.

I have another question for you. I need to work with new fastq files, like below.

[1] "18S191096_A1_CCAACA_55154.fastq" "18S191096_A2_TGATGC_36059.fastq" [3] "18S191096_A3_TAGCTT_56529.fastq" "18S191096_A4_AGCGCT_225780.fastq" [5] "18S191096_A5_GGTACA_75314.fastq" "18S191096_A6_ATCCTC_77083.fastq" [7] "18S191096_A7_CCTCAT_41177.fastq" "18S191096_A8_AATACA_52776.fastq"

These fastq files are from merged data. Can you tell me which DADA2 Pipeline Tutorial should I follow for these kinds fastq files?

Thank you.

benjjneb commented 2 years ago

The regular tutorial will work. Just skip the reverse read parts, and the merging step. You make the sequence table at the end with makeSequenceTable(dadaFs).

MDSHAHADATHOSSAIN commented 2 years ago

Thank you so much for your help.

benjjneb / dada2

To avoid bad PlotQualityProfiles and a lot of time for the calculation of error rates #1574