benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
466 stars 142 forks source link

removePrimers() not removing all primers in my reads and causing issues with filterAndTrim() #1714

Closed JohnBiggs153 closed 4 months ago

JohnBiggs153 commented 1 year ago

I'm trying to analyze a set of Paired FastQ files with dada2 in R. In my analysis I was doing before I realized that the raw reads still had their primers attached. So I went through the ITS workflow to remove my primers. But I'm limited to not using cutadapt function currently. So I was searching the dada2 functions and found the removePrimers function and decided to use that function to help me remove my primers.

With my testing I'm running into interesting issues where removePrimers function doesn't remove all my primers in my fastQ files and creates an imbalance of read counts where I can't use filterAndTrim function after using it.

Notes about data:

Code and Outputs I follow the workflow of the ITS analysis to search my raw reads to see if the primers are in there,

original <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = amp_F[[1]]), 
                  FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = amp_R[[1]]), 
                  REV.ForwardReads = sapply(REV.orients, primerHits, fn = amp_F[[1]]), 
                  REV.ReverseReads = sapply(REV.orients, primerHits, fn = amp_R[[1]]))
original

                 Forward Complement Reverse RevComp
FWD.ForwardReads    8840          8       8       9
FWD.ReverseReads      12         12      11      89
REV.ForwardReads       8          9       8     483
REV.ReverseReads    8936         11      11      12

amp_f is a variable where I store all the forward reads and amp_R is a variable where I store all the reverse reads.

So from here we can see the original count of primers in my raw reads. After I get this data I'll show 2 separate methods of removing primers to see which one is the best. Both methods don't remove all primers in my data and I don't know what might be causing this issue.

Test 1, using the removePrimers function

no_Primers_F <- file.path(fastq_path, "rem_Pri_F", basename(amp_F))
no_Primers_R <- file.path(fastq_path, "rem_Pri_R", basename(amp_R))
removePrimers(amp_F, no_Primers_F, primer.fwd = FWD)
removePrimers(amp_R, no_Primers_R, primer.fwd = REV)
rem_Pri_test <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = no_Primers_F[[1]]), 
                      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = no_Primers_R[[1]]), 
                      REV.ForwardReads = sapply(REV.orients, primerHits, fn = no_Primers_F[[1]]), 
                      REV.ReverseReads = sapply(REV.orients, primerHits, fn = no_Primers_R[[1]]))
rem_Pri_test

                Forward Complement Reverse RevComp
FWD.ForwardReads       9          8       8       9
FWD.ReverseReads      11         11      11      81
REV.ForwardReads       8          9       8      76
REV.ReverseReads      11         11      11      11

Above is my code block and output to show how I'm using the function and my results. I decided to use the removePrimers function in this way. not including the reverse primer in its parameters since that was causing a huge deletion of my raw reads (file size of my raw reads would go from ~2000-1000 kb in size to 5-1 kb in size files) after the function was done. As you can see this function doesn't even remove all of the primers in my raw reads and after doing other tests with the parameters, like changing the size of the mismatch parameter I can't seem to remove all of the primers in my data.

First test | MM = 2 |   
              | Forward | Complement | Reverse | RevComp
FWD forward | 9 | 8 | 8 | 9
FWD reverse | 11 | 11 | 11 | 81
REV forward | 8 | 9 | 8 | 76
Rev reverse | 11 | 11 | 11 | 11
   
second test | MM = 3 |  
               | Forward | Complement | Reverse | RevComp
FWD forward | 9 | 8 | 8 | 9
FWD reverse | 11 | 11 | 11 | 83
REV forward | 8 | 9 | 8 | 87
Rev reverse | 11 | 11 | 11 | 11
  
  
third test | MM = 1 |  
                | Forward | Complement | Reverse | RevComp
FWD forward | 9 | 8 | 8 | 9
FWD reverse | 11 | 11 | 11 | 81
REV forward | 8 | 9 | 8 | 73
Rev reverse | 11 | 11 | 11 | 11

The other issue when I use removePrimers() is that I can't use filterAndTrim() afterwards, the error that it throws up to me says that the number of reads between the edited forward and reverse files don't match up, like the forward reads have a smaller count compared to the reverse reads Error:

tnf_summary <- filterAndTrim(no_Primers_F, filt_F,
                             no_Primers_R, filt_R, 

                             #trimming
                             truncLen = c(240,180),

                             #filtering
                             maxEE = c(2,2), 
                             truncQ = 2)
tnf_summary

Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  : 
  Mismatched forward and reverse sequence files: 9051, 9193.

Test 2, using filterAndTrim() to remove primers I saw on this issue page that someone else was having issues with removing all of the primers form their data and the solution for them, if I remember correctly, was using filterAndTrim to remove their primers just be using trimLeft to cut off the first 19 bases for the forward reads and 20 bases for the reverse reads. So I tried to implement that into my data since removePrimers() is giving me some issues.

F_primers <- file.path(fastq_path, "f_and_t_F", basename(amp_F))
R_primers <- file.path(fastq_path, "f_and_t_R", basename(amp_R))
summary_test <- filterAndTrim(amp_F, F_primers, amp_R, R_primers, maxN=0, trimLeft = c(19,20))
store <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = F_primers[[1]]),
               FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = R_primers[[1]]), 
               REV.ForwardReads = sapply(REV.orients, primerHits, fn = F_primers[[1]]), 
               REV.ReverseReads = sapply(REV.orients, primerHits, fn = R_primers[[1]]))
store
summary_test

                Forward Complement Reverse RevComp
FWD.ForwardReads       1          0       0       0
FWD.ReverseReads       0          0       0      76
REV.ForwardReads       0          0       0     473
REV.ReverseReads       0          0       0       0

But we can see that it doesn't remove all of the primers from my data either, it even leaves me with more reads with primers in them than removePrimers(). So I'm currently stuck at a wall where the methods I can use to remove my primers are limited to these 2 options, at least these are the 2 options that I know of that are solely in R.

In my research I know that one of the assumptions for using dada2 is that I need to remove all non-biological nucleotides and I was wondering if having ~3% of the reads (gathered from test 2) will break the dada2 algorithms to help me turn this raw data into data I can put into phyloseq to analyze?

JohnBiggs153 commented 1 year ago

EDIT 1:

Following the steps of dada2 tutorial, it looks like that possibly that 3% of primers might be causing issues on my processing of the data.

            raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7348   6862   7004   4265       2969                      31.0
Amato169-C12     11753     9016   8346   8683   5037       3713                      31.6
Amato168-G6      13588    10709   9853  10231   6228       3845                      28.3
Amato167-G2      11647     9258   8516   8785   5091       3333                      28.6
Amato166-L3      17799    14500  13582  14099   8278       5786                      32.5
Amato165-H2      15473    12174  11421  11843   7078       4350                      28.1
Amato192-G10     18482    14711  13740  14143   8641       5048                      27.3
Amato164-D1      17969    14315  13515  13938   8625       6146                      34.2
Amato191-G5       9896     8020   7313   7649   3670       3017                      30.5
Amato190-G1       9233     7403   6721   6973   3493       2662                      28.8
Amato189-L2       7076     5595   5191   5327   3018       2076                      29.3
Amato188-H1      11108     9265   8660   8939   4928       3650                      32.9
Amato186-C22      8906     7048   6437   6679   3532       2617                      29.4
Amato185-C14     11852     9706   8869   9301   4856       3322                      28.0
Amato184-G8      20392    16273  14961  15642   7915       5449                      26.7
Amato183-G4      10996     8420   7624   7952   3857       2916                      26.5
Amato182-L5       9173     7569   7005   7311   4530       3535                      38.5
Amato181-L1       9861     8090   7517   7858   4434       2723                      27.6
Amato163-C25     14043    11240  10284  10736   5937       3925                      27.9
Amato180-D3      12587    10247   9547   9899   5594       4127                      32.8
Amato179-C24     15949    13066  12196  12614   7710       5134                      32.2
Amato178-C21     12372     9833   8979   9349   5186       3617                      29.2
Amato177-C13     11875     6659   6006   6338   3127       2246                      18.9
Amato176-G7      18838    13911  12721  13343   7003       4422                      23.5
Amato175-G3      16275    13391  12166  12942   6252       4858                      29.8
Amato174-L4      16232    13186  12373  12820   6691       4802                      29.6
Amato173-H3      10321     8556   8123   8274   4985       3114                      30.2
Amato172-D2      12226     9676   8971   9255   4816       3397                      27.8
Amato171-C23     12480    10098   9218   9613   4813       3661                      29.3
Amato162-C15      9644     6558   5961   6218   3352       2519                      26.1
Amato161-C11      9977     6829   6169   6432   3132       2282                      22.9

This is my tracking of the read counts as I go through and process my raw reads. In my opinion it looks like that the read counts look great until we join and remove chimeras from the data. When I was processing this data initially my tracking output looked very similar to this, which lead me to find that I still had my primers attached to my reads. I really don't know what might be causing this issue of my read count dropping after merging and after removing chimeras other than the primers might be causing this issue.

Here is the code for my trimAndFilter along with my merging and chimera removal.

# Actual filtering and trimming part
tnf_summary <- filterAndTrim(F_primers, filt_F,
                             R_primers, filt_R, 

                             #trimming
                             truncLen = c(220,170),

                             #filtering
                             maxEE = c(2,2), 
                             truncQ = 2)
tnf_summary

# Merging sequences

merged_amp <- mergePairs(dadaF = dada_F,
                         derepF = derep_F,
                         dadaR = dada_R,
                         derepR = derep_R)

seqtab_nochim <- removeBimeraDenovo(seqtab, method = "consensus", verbose=TRUE)

Forward before trim [Forward before trimming]

Reverse before trim [Reverse before trimming]

Forward reads after trim [Forward reads after trim]

Reverse reads after trim [Reverse reads after trim]

benjjneb commented 1 year ago

The recommended way to remove primers of fixed length, that appear at the start of your reads, is to use trimLeft. removePrimers is not recommended for Illumina data at all, it is developed for long-read amplicon data.

To me, the outcome of your filterAndTrim(..., trimLeft=c(19,20)) looks fine. There are a small number of "read through" primers being detected at the ends of read from short amplicons. However, given the length of your sequenced amplicon (>300nts), these must represent unintended amplification events, and thus we're not worried about them being potentially lost along the way. Furthermore, they won't interfere with denonising int he rest of the data.

In terms of your read-tracking, there are two concerning patterns, the large fraction of reads lost at merging, and the large fraction lost as chimeras. What is the expected distribution of sequenced lengths of this amplicon? You'll need to make sure your truncLen parameter set during the filterAndTrim stage retains enough sequence for the reads to overlap, i.e. forward + reverse truncation legnths should be at least ~20nts more than the sequenced amlpicon length.

THe chimera loss is usually related to incomplete removal of primers, but that doesn't seem to be the case here. Is there any possibility that unusual library construction strategies, such as the introducing of variable length pads or "heterogeneity spacers" were used to construct these libraries?

JohnBiggs153 commented 1 year ago

How would I find out the sequence length for my amplicons? When I was trying to figure out the truncLen values the math I did was,

515f/926r -> 926-515 = 411 sequence length primers length = 39 bp -> 411 - 39 = 372 sequence length

So for my truncLen values I chose cut off values that would be greater than 372 or 392. The example I showed above had cut off values of 390 when totaled together (220+170) but I think thats an acceptable range. For my data and metadata I gathered it from NCBI SRA run selector and couldn't find the expected distribution of sequenced lengths of the amplicon.

From the metadata and research paper tied to the metadata it doesn't look like it has any unusual library construction strategies so I don't know what else would be causing this big of an chimera removal (Library layout is Paired, Library selection is PCR, Assay type is Amplicon, AvgSpotLen 602). Bioproj number is [PRJNA574440] Paper is: "The effects of captivity on the primate gut microbiome varies with host dietary niche"

JohnBiggs153 commented 1 year ago

I was doing some research on finding out my average sequence lengths for my amplicons and it looks like the AvgSpotLen values is the expected sequence length of my amplicons. So my average sequence length is 600 so I guess that's why my quality profiles go up to ~300 bases before trimming. So would I still need to get my truncLen values to be greater than 372 or 392?

benjjneb commented 1 year ago

The relevant length is the sequenced amplicon length. If your primers are on your reads, they need to be included.

So your sequenced amplicon is more like 415nts, and the sum of your truncation lengths therefore should be 435+

JohnBiggs153 commented 1 year ago

I ran my run again with increasing my truncLen values to go over 435+.

My first run I used truncLen( c(230,210)) since I wanted to keep my reverse reads from including too many reads that had a bad quality scores. Other than editing my truncLen parameter I kept my maxEE scores 2,2. My read counts still looked bad after I would merge and remove my chimeras. I then tried to repeat this run with changing my maxEE parameters to be 2,5 since my reverse reads after ~180 start to dip in their quality scores. Which showed similar results of having a huge loss of reads occur after merging and removal of chimeras.

             raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7276   6661   6374   3786       2938                      30.6
Amato169-C12     11753     8886   8146   7876   4374       3439                      29.3
Amato168-G6      13588    10589   9686   9279   5502       3914                      28.8
Amato167-G2      11647     9133   8285   7971   4706       3295                      28.3
Amato166-L3      17799    14477  13508  13675   7436       5754                      32.3
Amato165-H2      15473    12143  11314  11134   6247       4080                      26.4
Amato192-G10     18482    14577  13447  12657   7710       4785                      25.9
Amato164-D1      17969    14383  13490  13235   7747       5840                      32.5
Amato191-G5       9896     8021   7212   6830   3412       2902                      29.3
Amato190-G1       9233     7360   6593   6014   3043       2539                      27.5
Amato189-L2       7076     5588   5091   4893   2764       2059                      29.1
Amato188-H1      11108     9221   8511   8369   4466       3565                      32.1
Amato186-C22      8906     6994   6304   5931   3293       2622                      29.4
Amato185-C14     11852     9624   8727   8122   4400       3247                      27.4
Amato184-G8      20392    16166  14713  14599   7066       5435                      26.7
Amato183-G4      10996     8404   7473   6999   3506       2811                      25.6
Amato182-L5       9173     7532   6934   6716   4103       3427                      37.4
Amato181-L1       9861     8052   7397   7346   3565       2723                      27.6
Amato163-C25     14043    11078  10013   9766   5215       3646                      26.0
Amato180-D3      12587    10268   9484   9212   5123       4177                      33.2
Amato179-C24     15949    13005  11972  11687   6761       4700                      29.5
Amato178-C21     12372     9767   8808   8464   4764       3376                      27.3
Amato177-C13     11875     6565   5870   5678   2839       2179                      18.3
Amato176-G7      18838    13827  12416  11921   6121       4158                      22.1
Amato175-G3      16275    13353  11959  11606   5494       4488                      27.6
Amato174-L4      16232    13196  12270  12111   5860       4629                      28.5
Amato173-H3      10321     8527   7998   8061   4691       3030                      29.4
Amato172-D2      12226     9641   8837   8559   4399       3188                      26.1
Amato171-C23     12480    10066   9030   8833   4497       3526                      28.3
Amato162-C15      9644     6501   5853   5537   2984       2272                      23.6
Amato161-C11      9977     6765   5926   5561   2724       2124                      21.3

After getting these results I looked into the researchers methods of amplification and saw that they used a 2 step PCR protocol where they introduced Fludigm CS1 and CS2 linker sequences added with the primers. Then used a Fluidigm AccessArray primers which contained sample-specific barcodes and Illumina sequence adapters as well to finish out their 2 step protocol.

The researchers said that the CS1 (5'- ACACTGACGACATGGTTCTACA) and CS2 (5'- TACGGTAGCAGAGACTTGGTCT) linkers were added to the forward and reverse primer sequences. After find out this information I tried to see if the linker sequences was in my data and it looks like they are but in a small amount. I used the block of code thats in the ITS tutorial to see if primers are in your data but swapping the primers and orienting them to see if the linker sequences were in there and it looks like they are but to a very small amount.

First here is the code for finding if the linker sequences were in there and then the results

original.cs <- rbind(CS1.ForwardReads = sapply(CS1.orients, primerHits, fn = amp_F[[1]]), 
                  CS1.ReverseReads = sapply(CS1.orients, primerHits, fn = amp_R[[1]]), 
                  CS2.ForwardReads = sapply(CS2V.orients, primerHits, fn = amp_F[[1]]), 
                  CS2.ReverseReads = sapply(CS2V.orients, primerHits, fn = amp_R[[1]]))
original.cs

                 Forward Complement Reverse RevComp
CS1.ForwardReads       7          8       7       8
CS1.ReverseReads      11         12      11     522
CS2.ForwardReads       7          8       8     510
CS2.ReverseReads      11         11      11      11

I also tried to see if removing the first 41 and 42 bases from my reads would change the outcome and remove the linker sequences but when I checked again to see if that removed them the same counts appeared again after I searched my data after trimming. After getting these results I am at a loss to figuring out how to improve my read counts after processing the data.

benjjneb commented 1 year ago

There's two problems here, and maybe we can solve one at a time.

First problem, low (<90%) of denoised reads are merging. Does the merging percentage increase if you look at even larger truncation lengths?

JohnBiggs153 commented 1 year ago

I am away from my work station so I will be able to give a better response on Monday. But I think increasing the truncLen values might’ve actually decreased the number of reads passed. Looking at the read tracking values above, the first one I posted the truncLen values were smaller than (230, 210) and after the merging went through, the smaller values increased my merge count.

As I said I’m away from my work station so I’m unable to see my notes that I wrote done on the runs I did but I do believe that the smaller values for truncLen gave me a bigger return on the merging function.

JohnBiggs153 commented 1 year ago

I checked some saved outputs that I created while trying to solve this issue and it looks like the increasing the values for truncLen decreases the reads carried over after merging. The parameters I've tested for truncLen were (220, 170) and (230, 210). For the truncLen(230, 210) test, I did increase the values for maxEE to be (2, 5) since at the 210 length for the reverse reads the quality score is really low, bellow a score of 20. So I thought it would help to increase the maxEE value since I'm incorporating a lot of errors into my reads by extending the reverse read trunc length.

Below I have my read counts for both of these test to show that it does look like increasing the values for truncLen does decrease the values of reads kept after merging.

(220, 170)

             raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7355   6867   7006   4280       2956                      30.8
Amato169-C12     11753     9028   8369   8691   5040       3677                      31.3
Amato168-G6      13588    10727   9902  10250   6244       3940                      29.0
Amato167-G2      11647     9285   8542   8781   5068       3337                      28.7
Amato166-L3      17800    14515  13602  14083   8198       5739                      32.2
Amato165-H2      15473    12176  11415  11844   7035       4347                      28.1
Amato192-G10     18482    14738  13779  14216   8687       5051                      27.3
Amato164-D1      17971    14320  13536  13883   8545       6015                      33.5
Amato191-G5       9897     8034   7333   7658   3700       3008                      30.4
Amato190-G1       9233     7408   6723   6963   3461       2645                      28.6
Amato189-L2       7076     5609   5196   5341   3012       2068                      29.2
Amato188-H1      11108     9273   8681   8931   4930       3629                      32.7
Amato186-C22      8906     7049   6438   6680   3547       2603                      29.2
Amato185-C14     11853     9721   8873   9316   4753       3238                      27.3
Amato184-G8      20392    16287  14973  15713   7910       5455                      26.8
Amato183-G4      10996     8408   7623   7931   3827       2887                      26.3
Amato182-L5       9173     7574   7016   7309   4529       3541                      38.6
Amato181-L1       9861     8094   7514   7867   4471       2720                      27.6
Amato163-C25     14043    11253  10293  10742   5987       3902                      27.8
Amato180-D3      12587    10246   9551   9877   5710       4168                      33.1
Amato179-C24     15950    13082  12205  12616   7697       5123                      32.1
Amato178-C21     12372     9847   9004   9358   5185       3616                      29.2
Amato177-C13     11875     6666   5996   6372   3137       2218                      18.7
Amato176-G7      18838    13925  12728  13364   7010       4419                      23.5
Amato175-G3      16275    13416  12166  12921   6205       4843                      29.8
Amato174-L4      16232    13209  12401  12827   6622       4790                      29.5
Amato173-H3      10322     8568   8148   8338   5002       3207                      31.1
Amato172-D2      12226     9679   8989   9244   4834       3402                      27.8
Amato171-C23     12480    10115   9213   9649   4812       3686                      29.5
Amato162-C15      9644     6568   5964   6241   3327       2446                      25.4
Amato161-C11      9977     6838   6183   6418   3117       2281                      22.9

(230, 210), maxEE(2,5)

             raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7323   6797   6411   3812       2926                      30.5
Amato169-C12     11753     8952   8225   7937   4399       3499                      29.8
Amato168-G6      13588    10635   9738   9319   5576       3971                      29.2
Amato167-G2      11647     9195   8378   8028   4715       3332                      28.6
Amato166-L3      17800    14537  13584  13622   7326       5699                      32.0
Amato165-H2      15473    12202  11378  11148   6301       4110                      26.6
Amato192-G10     18482    14653  13603  12681   7816       4817                      26.1
Amato164-D1      17971    14419  13514  13256   7743       5899                      32.8
Amato191-G5       9897     8064   7235   6799   3395       2905                      29.4
Amato190-G1       9233     7409   6653   6033   3055       2545                      27.6
Amato189-L2       7076     5621   5143   4922   2780       2082                      29.4
Amato188-H1      11108     9253   8600   8381   4496       3561                      32.1
Amato186-C22      8906     7032   6378   5892   3236       2543                      28.6
Amato185-C14     11853     9663   8786   8177   4343       3257                      27.5
Amato184-G8      20392    16248  14832  14653   7107       5409                      26.5
Amato183-G4      10996     8438   7518   7027   3522       2811                      25.6
Amato182-L5       9173     7570   6979   6785   4095       3394                      37.0
Amato181-L1       9861     8085   7489   7320   3689       2717                      27.6
Amato163-C25     14043    11176  10159   9848   5313       3662                      26.1
Amato180-D3      12587    10311   9559   9251   5111       4157                      33.0
Amato179-C24     15950    13088  12055  11767   6761       4714                      29.6
Amato178-C21     12372     9827   8889   8514   4803       3401                      27.5
Amato177-C13     11875     6604   5942   5688   2853       2212                      18.6
Amato176-G7      18838    13893  12515  11932   6083       4212                      22.4
Amato175-G3      16275    13407  12060  11649   5516       4507                      27.7
Amato174-L4      16232    13256  12361  12178   5909       4628                      28.5
Amato173-H3      10322     8558   8065   8115   4767       3075                      29.8
Amato172-D2      12226     9691   8963   8549   4423       3117                      25.5
Amato171-C23     12480    10110   9098   8815   4588       3536                      28.3
Amato162-C15      9644     6533   5899   5591   3009       2303                      23.9
Amato161-C11      9977     6798   6000   5535   2730       2128                      21.3

For the removal of chimeras it does look like the loss of reads is pretty consistent between the 2 tests that I did with some slight differences between both tests (looks like some samples the loss of chimeras increases or decreases between the 2 tests).

JohnBiggs153 commented 1 year ago

I re-ran another test after thinking about my trunLen for the filterAndTrim() and I chose to use c(250, 200) this actually gave me my biggest increase in reads carried over to merging and chimera removal but it still seems off to me since the loss of reads still looks big.

Here is my output

             raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7883   7090   7179   4190       3109                      32.4
Amato169-C12     11753     9641   8578   8822   5029       3965                      33.7
Amato168-G6      13588    11515  10343  10567   6131       4248                      31.3
Amato167-G2      11647     9936   8676   8924   5246       3637                      31.2
Amato166-L3      17799    15501  14251  14835   8204       6119                      34.4
Amato165-H2      15473    13012  11789  12331   6819       4588                      29.7
Amato192-G10     18482    15891  14305  14684   8771       5603                      30.3
Amato164-D1      17969    15370  14146  14539   8282       6288                      35.0
Amato191-G5       9896     8603   7460   7695   3743       3260                      32.9
Amato190-G1       9233     7945   6828   6869   3403       2776                      30.1
Amato189-L2       7076     5990   5255   5426   3047       2320                      32.8
Amato188-H1      11108     9906   8945   9409   5007       3914                      35.2
Amato186-C22      8906     7561   6652   6779   3680       2928                      32.9
Amato185-C14     11852    10302   9044   9135   4678       3569                      30.1
Amato184-G8      20392    17386  15379  16119   8056       5939                      29.1
Amato183-G4      10996     9092   7801   7995   3933       3147                      28.6
Amato182-L5       9173     8075   7261   7427   4574       3774                      41.1
Amato181-L1       9861     8632   7753   8121   4280       2979                      30.2
Amato163-C25     14043    12023  10517  10947   5855       4142                      29.5
Amato180-D3      12587    10966   9936  10235   5693       4524                      35.9
Amato179-C24     15949    13960  12573  12926   7333       5178                      32.5
Amato178-C21     12372    10571   9222   9419   5306       3832                      31.0
Amato177-C13     11875     7111   6106   6376   3156       2465                      20.8
Amato176-G7      18838    14958  12987  13572   6923       4690                      24.9
Amato175-G3      16275    14342  12463  13231   6302       5116                      31.4
Amato174-L4      16232    14139  12860  13424   6450       5117                      31.5
Amato173-H3      10321     9047   8238   8677   4933       3364                      32.6
Amato172-D2      12226    10322   9279   9477   4861       3616                      29.6
Amato171-C23     12480    10861   9349   9830   4918       3915                      31.4
Amato162-C15      9644     7027   6118   6307   3405       2587                      26.8
Amato161-C11      9977     7300   6167   6355   3101       2508                      25.1

In this run I also increased the maxEE values to maxEE = c(5,5) since the raw data quality profile scores look really bad after doing an initial trim of my raw data to try to remove the primers that are still in my sequences. I can't tell if increasing the size of the trunc score caused this increase of kept reads or if the maxEE change was the cause of it. I think increasing the maxEE value is good since I'm introducing a lot of bad quality scores into my merges but I don't know if that would increase the amount of reads carried over into my merging.

Here are the quality profile plots of my FWD and REV reads after the primer trimming to show why I chose to increase the maxEE scores. FWD with primer trim REV with primer trim

benjjneb commented 1 year ago

I would recommend trying just the forward reads. They are clearly better quality, and the large drop-off in merging and chimera removal is concerning. This will take care of the merging issue, and will help clarify if the chimera drop-off is baked into the data, or some artefact of poor merging.

JohnBiggs153 commented 1 year ago

I ran my data again with using truncLen = c(250,180) and maxEE = c(5,2) for my filterAndTrim() and it does look like increasing the trunc value for the forward reads did increase the reads carried over after merging. But not as much as the reads carried over in my pervious post. So it does look like I did need to increase the value of my truncLen parameter to give me more overhang when merging but it still doesn't give me a lot of reads carried over after merging.

Could the data itself be causing this issue like it might be causing an issue with my chimera removal?

Reads output

             raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16      9591     7259   6541   6871   4055       3006                      31.3
Amato169-C12     11753     8938   7954   8551   4863       3650                      31.1
Amato168-G6      13588    10637   9539  10162   6046       3838                      28.2
Amato167-G2      11647     9198   8035   8646   4978       3316                      28.5
Amato166-L3      17799    14313  13141  13870   7962       5628                      31.6
Amato165-H2      15473    12057  10930  11762   6757       4361                      28.2
Amato192-G10     18482    14612  13125  13858   8422       5098                      27.6
Amato164-D1      17969    14095  12937  13623   7944       5782                      32.2
Amato191-G5       9896     7933   6870   7448   3575       3091                      31.2
Amato190-G1       9233     7313   6304   6781   3315       2564                      27.8
Amato189-L2       7076     5528   4849   5228   2923       2102                      29.7
Amato188-H1      11108     9128   8245   8809   4822       3637                      32.7
Amato186-C22      8906     6949   6078   6555   3483       2654                      29.8
Amato185-C14     11852     9604   8410   9079   4618       3330                      28.1
Amato184-G8      20392    16110  14195  15390   7705       5422                      26.6
Amato183-G4      10996     8265   7077   7790   3616       2818                      25.6
Amato182-L5       9173     7510   6757   7251   4429       3493                      38.1
Amato181-L1       9861     7939   7078   7691   4322       2705                      27.4
Amato163-C25     14043    11206   9796  10621   5774       3950                      28.1
Amato180-D3      12587    10080   9096   9633   5339       4136                      32.9
Amato179-C24     15949    12996  11686  12530   7554       5132                      32.2
Amato178-C21     12372     9745   8504   9155   4941       3577                      28.9
Amato177-C13     11875     6596   5697   6214   3059       2248                      18.9
Amato176-G7      18838    13624  11833  12969   6646       4337                      23.0
Amato175-G3      16275    13275  11478  12714   6026       4828                      29.7
Amato174-L4      16232    13034  11856  12575   6308       4772                      29.4
Amato173-H3      10321     8560   7781   8338   4886       3205                      31.1
Amato172-D2      12226     9593   8587   9125   4675       3437                      28.1
Amato171-C23     12480     9987   8600   9491   4675       3698                      29.6
Amato162-C15      9644     6505   5671   6127   3321       2557                      26.5
Amato161-C11      9977     6772   5754   6278   3022       2267                      22.7
benjjneb commented 1 year ago

Could you try running just the forward reads? That means no reverse reads, and no merging.

JohnBiggs153 commented 1 year ago

I ran filterAndTrim for just the forward and reverse reads by themselves. I got more reads saved from the filter step when I ran the forward reads alone compared to the reverse reads. For these tests I kept the maxEE value for them the same and the truncLens the same, so its truncLen = c(250,180) maxxEE = c(5,2). I also ran the forward reads with a maxEE value of 2 to see how the EE value would affect the forward reads.

Forward read results EE = 5

                       reads.in reads.out
SRR10190028_1.fastq.gz     9591      8385
SRR10190029_1.fastq.gz    11753     10113
SRR10190030_1.fastq.gz    13588     12175
SRR10190031_1.fastq.gz    11647     10467
SRR10190032_1.fastq.gz    17799     16439
SRR10190033_1.fastq.gz    15473     13807
SRR10190034_1.fastq.gz    18482     16845
SRR10190035_1.fastq.gz    17969     16375
SRR10190036_1.fastq.gz     9896      9070
SRR10190037_1.fastq.gz     9233      8395
SRR10190038_1.fastq.gz     7076      6399
SRR10190039_1.fastq.gz    11108     10371
SRR10190040_1.fastq.gz     8906      8072
SRR10190041_1.fastq.gz    11852     10804
SRR10190042_1.fastq.gz    20392     18404
SRR10190043_1.fastq.gz    10996      9673
SRR10190044_1.fastq.gz     9173      8500
SRR10190045_1.fastq.gz     9861      9125
SRR10190046_1.fastq.gz    14043     12701
SRR10190047_1.fastq.gz    12587     11582
SRR10190048_1.fastq.gz    15949     14662
SRR10190049_1.fastq.gz    12372     11226
SRR10190050_1.fastq.gz    11875      7477
SRR10190051_1.fastq.gz    18838     15973
SRR10190052_1.fastq.gz    16275     15107
SRR10190053_1.fastq.gz    16232     14921
SRR10190054_1.fastq.gz    10321      9459
SRR10190055_1.fastq.gz    12226     10852
SRR10190056_1.fastq.gz    12480     11469
SRR10190057_1.fastq.gz     9644      7465
SRR10190058_1.fastq.gz     9977      7739

Forward read results EE = 2

                       reads.in reads.out
SRR10190028_1.fastq.gz     9591      6796
SRR10190029_1.fastq.gz    11753      8213
SRR10190030_1.fastq.gz    13588      9694
SRR10190031_1.fastq.gz    11647      8444
SRR10190032_1.fastq.gz    17799     13662
SRR10190033_1.fastq.gz    15473     11511
SRR10190034_1.fastq.gz    18482     13410
SRR10190035_1.fastq.gz    17969     13726
SRR10190036_1.fastq.gz     9896      7540
SRR10190037_1.fastq.gz     9233      6914
SRR10190038_1.fastq.gz     7076      5358
SRR10190039_1.fastq.gz    11108      8685
SRR10190040_1.fastq.gz     8906      6602
SRR10190041_1.fastq.gz    11852      8988
SRR10190042_1.fastq.gz    20392     15244
SRR10190043_1.fastq.gz    10996      7957
SRR10190044_1.fastq.gz     9173      7105
SRR10190045_1.fastq.gz     9861      7659
SRR10190046_1.fastq.gz    14043     10259
SRR10190047_1.fastq.gz    12587      9792
SRR10190048_1.fastq.gz    15949     12144
SRR10190049_1.fastq.gz    12372      9041
SRR10190050_1.fastq.gz    11875      6059
SRR10190051_1.fastq.gz    18838     13086
SRR10190052_1.fastq.gz    16275     12589
SRR10190053_1.fastq.gz    16232     12470
SRR10190054_1.fastq.gz    10321      8032
SRR10190055_1.fastq.gz    12226      9128
SRR10190056_1.fastq.gz    12480      9379
SRR10190057_1.fastq.gz     9644      6049
SRR10190058_1.fastq.gz     9977      6273

Reverse reads results

                       reads.in reads.out
SRR10190028_2.fastq.gz     9591      7374
SRR10190029_2.fastq.gz    11753      9105
SRR10190030_2.fastq.gz    13588     10836
SRR10190031_2.fastq.gz    11647      9360
SRR10190032_2.fastq.gz    17799     14538
SRR10190033_2.fastq.gz    15473     12324
SRR10190034_2.fastq.gz    18482     14845
SRR10190035_2.fastq.gz    17969     14279
SRR10190036_2.fastq.gz     9896      8079
SRR10190037_2.fastq.gz     9233      7433
SRR10190038_2.fastq.gz     7076      5633
SRR10190039_2.fastq.gz    11108      9328
SRR10190040_2.fastq.gz     8906      7058
SRR10190041_2.fastq.gz    11852      9760
SRR10190042_2.fastq.gz    20392     16361
SRR10190043_2.fastq.gz    10996      8386
SRR10190044_2.fastq.gz     9173      7657
SRR10190045_2.fastq.gz     9861      8101
SRR10190046_2.fastq.gz    14043     11412
SRR10190047_2.fastq.gz    12587     10234
SRR10190048_2.fastq.gz    15949     13224
SRR10190049_2.fastq.gz    12372      9932
SRR10190050_2.fastq.gz    11875      6738
SRR10190051_2.fastq.gz    18838     13858
SRR10190052_2.fastq.gz    16275     13443
SRR10190053_2.fastq.gz    16232     13259
SRR10190054_2.fastq.gz    10321      8674
SRR10190055_2.fastq.gz    12226      9768
SRR10190056_2.fastq.gz    12480     10151
SRR10190057_2.fastq.gz     9644      6624
SRR10190058_2.fastq.gz     9977      6908

I think the maxEE value for the forward reads is needed since with the truncLen value I chose the forward reads incorporates a lot of low quality scores (in the 20s) into my filter process. But with these results seen here does it show that the reverse reads might be the issue for our loss in counts when we merge?

benjjneb commented 1 year ago

Could you try running just the forward reads? That means no reverse reads, and no merging.

To clarify, what I mean here is that you should run just the forward reads through the entire DADA2 workflow. From filtering all the way through to chimera removal.

For the reasons described above, this could really help clarify what is going on with your data.

JohnBiggs153 commented 1 year ago

I ran through the FWD reads and here is the outputs of the read counts.

Output for just FWD reads

            raw_reads filtered ASVs_F no_chimera final_perc_reads_retained
Amato170-C16      9591     8385   7526       6523                      68.0
Amato169-C12     11753    10113   8987       8077                      68.7
Amato168-G6      13588    12175  10874       9117                      67.1
Amato167-G2      11647    10467   9159       7550                      64.8
Amato166-L3      17799    16439  15052      12743                      71.6
Amato165-H2      15473    13807  12469      10555                      68.2
Amato192-G10     18482    16845  15118      12048                      65.2
Amato164-D1      17969    16375  15016      12740                      70.9
Amato191-G5       9896     9070   7875       7304                      73.8
Amato190-G1       9233     8395   7232       6449                      69.8
Amato189-L2       7076     6399   5588       5080                      71.8
Amato188-H1      11108    10371   9358       8580                      77.2
Amato186-C22      8906     8072   7034       6542                      73.5
Amato185-C14     11852    10804   9456       8473                      71.5
Amato184-G8      20392    18404  16187      13990                      68.6
Amato183-G4      10996     9673   8277       7397                      67.3
Amato182-L5       9173     8500   7617       6949                      75.8
Amato181-L1       9861     9125   8175       6729                      68.2
Amato163-C25     14043    12701  11068       9691                      69.0
Amato180-D3      12587    11582  10476       9111                      72.4
Amato179-C24     15949    14662  13171      11523                      72.2
Amato178-C21     12372    11226   9747       8468                      68.4
Amato177-C13     11875     7477   6426       5732                      48.3
Amato176-G7      18838    15973  13844      11509                      61.1
Amato175-G3      16275    15107  13103      11922                      73.3
Amato174-L4      16232    14921  13524      11939                      73.6
Amato173-H3      10321     9459   8593       7281                      70.5
Amato172-D2      12226    10852   9759       8565                      70.1
Amato171-C23     12480    11469   9850       9094                      72.9
Amato162-C15      9644     7465   6506       5919                      61.4
Amato161-C11      9977     7739   6549       6026                      60.4

Overall from this run it doesn't look like the loss of reads is that big. Looks like the biggest loss of reads is just in the filter and trimming step, which lines up with what is said in the DADA2 16S workflow. I will run through this type of test with the REV reads and post it those results when its complete.

JohnBiggs153 commented 1 year ago

The above test the truncLen = 250, maxEE = 5 (for the FWD reads run). I ran the REV reads just by themselves so we can see what they look like as well. truncLen = 180, maxEE = 2.

REV reads run

             raw_reads filtered ASVs_R no_chimera final_perc_reads_retained
Amato170-C16      9591     7374   6922       6024                      62.8
Amato169-C12     11753     9105   8603       7405                      63.0
Amato168-G6      13588    10836  10299       7804                      57.4
Amato167-G2      11647     9360   8773       6620                      56.8
Amato166-L3      17799    14538  14039      10978                      61.7
Amato165-H2      15473    12324  11895       9750                      63.0
Amato192-G10     18482    14845  14087      10465                      56.6
Amato164-D1      17969    14279  13810      11073                      61.6
Amato191-G5       9896     8079   7515       6834                      69.1
Amato190-G1       9233     7433   6872       6140                      66.5
Amato189-L2       7076     5633   5293       4662                      65.9
Amato188-H1      11108     9328   8927       7389                      66.5
Amato186-C22      8906     7058   6654       5744                      64.5
Amato185-C14     11852     9760   9199       7591                      64.0
Amato184-G8      20392    16361  15656      12633                      62.0
Amato183-G4      10996     8386   7861       6796                      61.8
Amato182-L5       9173     7657   7313       6034                      65.8
Amato181-L1       9861     8101   7774       5658                      57.4
Amato163-C25     14043    11412  10806       9074                      64.6
Amato180-D3      12587    10234   9771       8021                      63.7
Amato179-C24     15949    13224  12698      10637                      66.7
Amato178-C21     12372     9932   9353       8205                      66.3
Amato177-C13     11875     6738   6292       5427                      45.7
Amato176-G7      18838    13858  13133      10161                      53.9
Amato175-G3      16275    13443  12883      11054                      67.9
Amato174-L4      16232    13259  12728      10458                      64.4
Amato173-H3      10321     8674   8412       7477                      72.4
Amato172-D2      12226     9768   9252       7532                      61.6
Amato171-C23     12480    10151   9635       8225                      65.9
Amato162-C15      9644     6624   6189       5269                      54.6
Amato161-C11      9977     6908   6384       5493                      55.1

Comparing the 2, it does look like the FWD reads do save more reads throughout this process, which lines up with what was said about Illumina sequencing in the DADA2 workflow. It looks like the filter step is the step that removes the most amount of reads.

benjjneb commented 1 year ago

Based on what you've shown here, I would recommend just going forward with the forward reads (no pun intended).

JohnBiggs153 commented 1 year ago

So the 60-70% retention of reads carried over from the raw data to the preprocessing of this data is an acceptable loss? I’m still new to microbiome analysis and want to make sure that this is an acceptable loss.

With continuing with just the forward reads would that change my results of taxonomic assignment and downstream analysis of this data? Would it just be considered doing a single-direction sequencing instead of a paired end sequencing?

Thank you for all of your help benjjneb!

benjjneb commented 1 year ago

If your loss is mostly at the filtering stage, and not at any of the specific steps outlined in the dada2 tutorial, then yes, 60-70% retention from raw reads to final output is normal.

With continuing with just the forward reads would that change my results of taxonomic assignment and downstream analysis of this data? Would it just be considered doing a single-direction sequencing instead of a paired end sequencing?

Yes it would be considered single-direction sequencing. Given the issues described above, it would be changing your analysis of the data... for the better.

JohnBiggs153 commented 1 year ago

Thank you for all of your help! I'm still learning microbiome analysis and this definitely helped me out. Solved my issue I was running into.