Closed JohnBiggs153 closed 4 months ago
EDIT 1:
Following the steps of dada2 tutorial, it looks like that possibly that 3% of primers might be causing issues on my processing of the data.
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7348 6862 7004 4265 2969 31.0
Amato169-C12 11753 9016 8346 8683 5037 3713 31.6
Amato168-G6 13588 10709 9853 10231 6228 3845 28.3
Amato167-G2 11647 9258 8516 8785 5091 3333 28.6
Amato166-L3 17799 14500 13582 14099 8278 5786 32.5
Amato165-H2 15473 12174 11421 11843 7078 4350 28.1
Amato192-G10 18482 14711 13740 14143 8641 5048 27.3
Amato164-D1 17969 14315 13515 13938 8625 6146 34.2
Amato191-G5 9896 8020 7313 7649 3670 3017 30.5
Amato190-G1 9233 7403 6721 6973 3493 2662 28.8
Amato189-L2 7076 5595 5191 5327 3018 2076 29.3
Amato188-H1 11108 9265 8660 8939 4928 3650 32.9
Amato186-C22 8906 7048 6437 6679 3532 2617 29.4
Amato185-C14 11852 9706 8869 9301 4856 3322 28.0
Amato184-G8 20392 16273 14961 15642 7915 5449 26.7
Amato183-G4 10996 8420 7624 7952 3857 2916 26.5
Amato182-L5 9173 7569 7005 7311 4530 3535 38.5
Amato181-L1 9861 8090 7517 7858 4434 2723 27.6
Amato163-C25 14043 11240 10284 10736 5937 3925 27.9
Amato180-D3 12587 10247 9547 9899 5594 4127 32.8
Amato179-C24 15949 13066 12196 12614 7710 5134 32.2
Amato178-C21 12372 9833 8979 9349 5186 3617 29.2
Amato177-C13 11875 6659 6006 6338 3127 2246 18.9
Amato176-G7 18838 13911 12721 13343 7003 4422 23.5
Amato175-G3 16275 13391 12166 12942 6252 4858 29.8
Amato174-L4 16232 13186 12373 12820 6691 4802 29.6
Amato173-H3 10321 8556 8123 8274 4985 3114 30.2
Amato172-D2 12226 9676 8971 9255 4816 3397 27.8
Amato171-C23 12480 10098 9218 9613 4813 3661 29.3
Amato162-C15 9644 6558 5961 6218 3352 2519 26.1
Amato161-C11 9977 6829 6169 6432 3132 2282 22.9
This is my tracking of the read counts as I go through and process my raw reads. In my opinion it looks like that the read counts look great until we join and remove chimeras from the data. When I was processing this data initially my tracking output looked very similar to this, which lead me to find that I still had my primers attached to my reads. I really don't know what might be causing this issue of my read count dropping after merging and after removing chimeras other than the primers might be causing this issue.
Here is the code for my trimAndFilter along with my merging and chimera removal.
# Actual filtering and trimming part
tnf_summary <- filterAndTrim(F_primers, filt_F,
R_primers, filt_R,
#trimming
truncLen = c(220,170),
#filtering
maxEE = c(2,2),
truncQ = 2)
tnf_summary
# Merging sequences
merged_amp <- mergePairs(dadaF = dada_F,
derepF = derep_F,
dadaR = dada_R,
derepR = derep_R)
seqtab_nochim <- removeBimeraDenovo(seqtab, method = "consensus", verbose=TRUE)
[Forward before trimming]
[Reverse before trimming]
[Forward reads after trim]
[Reverse reads after trim]
The recommended way to remove primers of fixed length, that appear at the start of your reads, is to use trimLeft
. removePrimers
is not recommended for Illumina data at all, it is developed for long-read amplicon data.
To me, the outcome of your filterAndTrim(..., trimLeft=c(19,20))
looks fine. There are a small number of "read through" primers being detected at the ends of read from short amplicons. However, given the length of your sequenced amplicon (>300nts), these must represent unintended amplification events, and thus we're not worried about them being potentially lost along the way. Furthermore, they won't interfere with denonising int he rest of the data.
In terms of your read-tracking, there are two concerning patterns, the large fraction of reads lost at merging, and the large fraction lost as chimeras. What is the expected distribution of sequenced lengths of this amplicon? You'll need to make sure your truncLen
parameter set during the filterAndTrim
stage retains enough sequence for the reads to overlap, i.e. forward + reverse truncation legnths should be at least ~20nts more than the sequenced amlpicon length.
THe chimera loss is usually related to incomplete removal of primers, but that doesn't seem to be the case here. Is there any possibility that unusual library construction strategies, such as the introducing of variable length pads or "heterogeneity spacers" were used to construct these libraries?
How would I find out the sequence length for my amplicons? When I was trying to figure out the truncLen values the math I did was,
515f/926r -> 926-515 = 411 sequence length primers length = 39 bp -> 411 - 39 = 372 sequence length
So for my truncLen values I chose cut off values that would be greater than 372 or 392. The example I showed above had cut off values of 390 when totaled together (220+170) but I think thats an acceptable range. For my data and metadata I gathered it from NCBI SRA run selector and couldn't find the expected distribution of sequenced lengths of the amplicon.
From the metadata and research paper tied to the metadata it doesn't look like it has any unusual library construction strategies so I don't know what else would be causing this big of an chimera removal (Library layout is Paired, Library selection is PCR, Assay type is Amplicon, AvgSpotLen 602). Bioproj number is [PRJNA574440] Paper is: "The effects of captivity on the primate gut microbiome varies with host dietary niche"
I was doing some research on finding out my average sequence lengths for my amplicons and it looks like the AvgSpotLen values is the expected sequence length of my amplicons. So my average sequence length is 600 so I guess that's why my quality profiles go up to ~300 bases before trimming. So would I still need to get my truncLen
values to be greater than 372 or 392?
The relevant length is the sequenced amplicon length. If your primers are on your reads, they need to be included.
So your sequenced amplicon is more like 415nts, and the sum of your truncation lengths therefore should be 435+
I ran my run again with increasing my truncLen
values to go over 435+.
My first run I used truncLen( c(230,210))
since I wanted to keep my reverse reads from including too many reads that had a bad quality scores. Other than editing my truncLen
parameter I kept my maxEE
scores 2,2
. My read counts still looked bad after I would merge and remove my chimeras. I then tried to repeat this run with changing my maxEE
parameters to be 2,5
since my reverse reads after ~180 start to dip in their quality scores. Which showed similar results of having a huge loss of reads occur after merging and removal of chimeras.
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7276 6661 6374 3786 2938 30.6
Amato169-C12 11753 8886 8146 7876 4374 3439 29.3
Amato168-G6 13588 10589 9686 9279 5502 3914 28.8
Amato167-G2 11647 9133 8285 7971 4706 3295 28.3
Amato166-L3 17799 14477 13508 13675 7436 5754 32.3
Amato165-H2 15473 12143 11314 11134 6247 4080 26.4
Amato192-G10 18482 14577 13447 12657 7710 4785 25.9
Amato164-D1 17969 14383 13490 13235 7747 5840 32.5
Amato191-G5 9896 8021 7212 6830 3412 2902 29.3
Amato190-G1 9233 7360 6593 6014 3043 2539 27.5
Amato189-L2 7076 5588 5091 4893 2764 2059 29.1
Amato188-H1 11108 9221 8511 8369 4466 3565 32.1
Amato186-C22 8906 6994 6304 5931 3293 2622 29.4
Amato185-C14 11852 9624 8727 8122 4400 3247 27.4
Amato184-G8 20392 16166 14713 14599 7066 5435 26.7
Amato183-G4 10996 8404 7473 6999 3506 2811 25.6
Amato182-L5 9173 7532 6934 6716 4103 3427 37.4
Amato181-L1 9861 8052 7397 7346 3565 2723 27.6
Amato163-C25 14043 11078 10013 9766 5215 3646 26.0
Amato180-D3 12587 10268 9484 9212 5123 4177 33.2
Amato179-C24 15949 13005 11972 11687 6761 4700 29.5
Amato178-C21 12372 9767 8808 8464 4764 3376 27.3
Amato177-C13 11875 6565 5870 5678 2839 2179 18.3
Amato176-G7 18838 13827 12416 11921 6121 4158 22.1
Amato175-G3 16275 13353 11959 11606 5494 4488 27.6
Amato174-L4 16232 13196 12270 12111 5860 4629 28.5
Amato173-H3 10321 8527 7998 8061 4691 3030 29.4
Amato172-D2 12226 9641 8837 8559 4399 3188 26.1
Amato171-C23 12480 10066 9030 8833 4497 3526 28.3
Amato162-C15 9644 6501 5853 5537 2984 2272 23.6
Amato161-C11 9977 6765 5926 5561 2724 2124 21.3
After getting these results I looked into the researchers methods of amplification and saw that they used a 2 step PCR protocol where they introduced Fludigm CS1 and CS2 linker sequences added with the primers. Then used a Fluidigm AccessArray primers which contained sample-specific barcodes and Illumina sequence adapters as well to finish out their 2 step protocol.
The researchers said that the CS1 (5'- ACACTGACGACATGGTTCTACA)
and CS2 (5'- TACGGTAGCAGAGACTTGGTCT)
linkers were added to the forward and reverse primer sequences. After find out this information I tried to see if the linker sequences was in my data and it looks like they are but in a small amount. I used the block of code thats in the ITS tutorial to see if primers are in your data but swapping the primers and orienting them to see if the linker sequences were in there and it looks like they are but to a very small amount.
First here is the code for finding if the linker sequences were in there and then the results
original.cs <- rbind(CS1.ForwardReads = sapply(CS1.orients, primerHits, fn = amp_F[[1]]),
CS1.ReverseReads = sapply(CS1.orients, primerHits, fn = amp_R[[1]]),
CS2.ForwardReads = sapply(CS2V.orients, primerHits, fn = amp_F[[1]]),
CS2.ReverseReads = sapply(CS2V.orients, primerHits, fn = amp_R[[1]]))
original.cs
Forward Complement Reverse RevComp
CS1.ForwardReads 7 8 7 8
CS1.ReverseReads 11 12 11 522
CS2.ForwardReads 7 8 8 510
CS2.ReverseReads 11 11 11 11
I also tried to see if removing the first 41 and 42 bases from my reads would change the outcome and remove the linker sequences but when I checked again to see if that removed them the same counts appeared again after I searched my data after trimming. After getting these results I am at a loss to figuring out how to improve my read counts after processing the data.
There's two problems here, and maybe we can solve one at a time.
First problem, low (<90%) of denoised reads are merging. Does the merging percentage increase if you look at even larger truncation lengths?
I am away from my work station so I will be able to give a better response on Monday. But I think increasing the truncLen values might’ve actually decreased the number of reads passed. Looking at the read tracking values above, the first one I posted the truncLen values were smaller than (230, 210) and after the merging went through, the smaller values increased my merge count.
As I said I’m away from my work station so I’m unable to see my notes that I wrote done on the runs I did but I do believe that the smaller values for truncLen gave me a bigger return on the merging function.
I checked some saved outputs that I created while trying to solve this issue and it looks like the increasing the values for truncLen
decreases the reads carried over after merging. The parameters I've tested for truncLen
were (220, 170)
and (230, 210)
. For the truncLen(230, 210)
test, I did increase the values for maxEE
to be (2, 5)
since at the 210 length for the reverse reads the quality score is really low, bellow a score of 20. So I thought it would help to increase the maxEE
value since I'm incorporating a lot of errors into my reads by extending the reverse read trunc length.
Below I have my read counts for both of these test to show that it does look like increasing the values for truncLen
does decrease the values of reads kept after merging.
(220, 170)
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7355 6867 7006 4280 2956 30.8
Amato169-C12 11753 9028 8369 8691 5040 3677 31.3
Amato168-G6 13588 10727 9902 10250 6244 3940 29.0
Amato167-G2 11647 9285 8542 8781 5068 3337 28.7
Amato166-L3 17800 14515 13602 14083 8198 5739 32.2
Amato165-H2 15473 12176 11415 11844 7035 4347 28.1
Amato192-G10 18482 14738 13779 14216 8687 5051 27.3
Amato164-D1 17971 14320 13536 13883 8545 6015 33.5
Amato191-G5 9897 8034 7333 7658 3700 3008 30.4
Amato190-G1 9233 7408 6723 6963 3461 2645 28.6
Amato189-L2 7076 5609 5196 5341 3012 2068 29.2
Amato188-H1 11108 9273 8681 8931 4930 3629 32.7
Amato186-C22 8906 7049 6438 6680 3547 2603 29.2
Amato185-C14 11853 9721 8873 9316 4753 3238 27.3
Amato184-G8 20392 16287 14973 15713 7910 5455 26.8
Amato183-G4 10996 8408 7623 7931 3827 2887 26.3
Amato182-L5 9173 7574 7016 7309 4529 3541 38.6
Amato181-L1 9861 8094 7514 7867 4471 2720 27.6
Amato163-C25 14043 11253 10293 10742 5987 3902 27.8
Amato180-D3 12587 10246 9551 9877 5710 4168 33.1
Amato179-C24 15950 13082 12205 12616 7697 5123 32.1
Amato178-C21 12372 9847 9004 9358 5185 3616 29.2
Amato177-C13 11875 6666 5996 6372 3137 2218 18.7
Amato176-G7 18838 13925 12728 13364 7010 4419 23.5
Amato175-G3 16275 13416 12166 12921 6205 4843 29.8
Amato174-L4 16232 13209 12401 12827 6622 4790 29.5
Amato173-H3 10322 8568 8148 8338 5002 3207 31.1
Amato172-D2 12226 9679 8989 9244 4834 3402 27.8
Amato171-C23 12480 10115 9213 9649 4812 3686 29.5
Amato162-C15 9644 6568 5964 6241 3327 2446 25.4
Amato161-C11 9977 6838 6183 6418 3117 2281 22.9
(230, 210), maxEE(2,5)
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7323 6797 6411 3812 2926 30.5
Amato169-C12 11753 8952 8225 7937 4399 3499 29.8
Amato168-G6 13588 10635 9738 9319 5576 3971 29.2
Amato167-G2 11647 9195 8378 8028 4715 3332 28.6
Amato166-L3 17800 14537 13584 13622 7326 5699 32.0
Amato165-H2 15473 12202 11378 11148 6301 4110 26.6
Amato192-G10 18482 14653 13603 12681 7816 4817 26.1
Amato164-D1 17971 14419 13514 13256 7743 5899 32.8
Amato191-G5 9897 8064 7235 6799 3395 2905 29.4
Amato190-G1 9233 7409 6653 6033 3055 2545 27.6
Amato189-L2 7076 5621 5143 4922 2780 2082 29.4
Amato188-H1 11108 9253 8600 8381 4496 3561 32.1
Amato186-C22 8906 7032 6378 5892 3236 2543 28.6
Amato185-C14 11853 9663 8786 8177 4343 3257 27.5
Amato184-G8 20392 16248 14832 14653 7107 5409 26.5
Amato183-G4 10996 8438 7518 7027 3522 2811 25.6
Amato182-L5 9173 7570 6979 6785 4095 3394 37.0
Amato181-L1 9861 8085 7489 7320 3689 2717 27.6
Amato163-C25 14043 11176 10159 9848 5313 3662 26.1
Amato180-D3 12587 10311 9559 9251 5111 4157 33.0
Amato179-C24 15950 13088 12055 11767 6761 4714 29.6
Amato178-C21 12372 9827 8889 8514 4803 3401 27.5
Amato177-C13 11875 6604 5942 5688 2853 2212 18.6
Amato176-G7 18838 13893 12515 11932 6083 4212 22.4
Amato175-G3 16275 13407 12060 11649 5516 4507 27.7
Amato174-L4 16232 13256 12361 12178 5909 4628 28.5
Amato173-H3 10322 8558 8065 8115 4767 3075 29.8
Amato172-D2 12226 9691 8963 8549 4423 3117 25.5
Amato171-C23 12480 10110 9098 8815 4588 3536 28.3
Amato162-C15 9644 6533 5899 5591 3009 2303 23.9
Amato161-C11 9977 6798 6000 5535 2730 2128 21.3
For the removal of chimeras it does look like the loss of reads is pretty consistent between the 2 tests that I did with some slight differences between both tests (looks like some samples the loss of chimeras increases or decreases between the 2 tests).
I re-ran another test after thinking about my trunLen
for the filterAndTrim()
and I chose to use c(250, 200)
this actually gave me my biggest increase in reads carried over to merging and chimera removal but it still seems off to me since the loss of reads still looks big.
Here is my output
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7883 7090 7179 4190 3109 32.4
Amato169-C12 11753 9641 8578 8822 5029 3965 33.7
Amato168-G6 13588 11515 10343 10567 6131 4248 31.3
Amato167-G2 11647 9936 8676 8924 5246 3637 31.2
Amato166-L3 17799 15501 14251 14835 8204 6119 34.4
Amato165-H2 15473 13012 11789 12331 6819 4588 29.7
Amato192-G10 18482 15891 14305 14684 8771 5603 30.3
Amato164-D1 17969 15370 14146 14539 8282 6288 35.0
Amato191-G5 9896 8603 7460 7695 3743 3260 32.9
Amato190-G1 9233 7945 6828 6869 3403 2776 30.1
Amato189-L2 7076 5990 5255 5426 3047 2320 32.8
Amato188-H1 11108 9906 8945 9409 5007 3914 35.2
Amato186-C22 8906 7561 6652 6779 3680 2928 32.9
Amato185-C14 11852 10302 9044 9135 4678 3569 30.1
Amato184-G8 20392 17386 15379 16119 8056 5939 29.1
Amato183-G4 10996 9092 7801 7995 3933 3147 28.6
Amato182-L5 9173 8075 7261 7427 4574 3774 41.1
Amato181-L1 9861 8632 7753 8121 4280 2979 30.2
Amato163-C25 14043 12023 10517 10947 5855 4142 29.5
Amato180-D3 12587 10966 9936 10235 5693 4524 35.9
Amato179-C24 15949 13960 12573 12926 7333 5178 32.5
Amato178-C21 12372 10571 9222 9419 5306 3832 31.0
Amato177-C13 11875 7111 6106 6376 3156 2465 20.8
Amato176-G7 18838 14958 12987 13572 6923 4690 24.9
Amato175-G3 16275 14342 12463 13231 6302 5116 31.4
Amato174-L4 16232 14139 12860 13424 6450 5117 31.5
Amato173-H3 10321 9047 8238 8677 4933 3364 32.6
Amato172-D2 12226 10322 9279 9477 4861 3616 29.6
Amato171-C23 12480 10861 9349 9830 4918 3915 31.4
Amato162-C15 9644 7027 6118 6307 3405 2587 26.8
Amato161-C11 9977 7300 6167 6355 3101 2508 25.1
In this run I also increased the maxEE
values to maxEE = c(5,5)
since the raw data quality profile scores look really bad after doing an initial trim of my raw data to try to remove the primers that are still in my sequences. I can't tell if increasing the size of the trunc score caused this increase of kept reads or if the maxEE
change was the cause of it. I think increasing the maxEE
value is good since I'm introducing a lot of bad quality scores into my merges but I don't know if that would increase the amount of reads carried over into my merging.
Here are the quality profile plots of my FWD and REV reads after the primer trimming to show why I chose to increase the maxEE scores.
I would recommend trying just the forward reads. They are clearly better quality, and the large drop-off in merging and chimera removal is concerning. This will take care of the merging issue, and will help clarify if the chimera drop-off is baked into the data, or some artefact of poor merging.
I ran my data again with using truncLen = c(250,180)
and maxEE = c(5,2)
for my filterAndTrim()
and it does look like increasing the trunc value for the forward reads did increase the reads carried over after merging. But not as much as the reads carried over in my pervious post. So it does look like I did need to increase the value of my truncLen parameter to give me more overhang when merging but it still doesn't give me a lot of reads carried over after merging.
Could the data itself be causing this issue like it might be causing an issue with my chimera removal?
Reads output
raw_reads filtered ASVs_F ASVs_R joined no_chimera final_perc_reads_retained
Amato170-C16 9591 7259 6541 6871 4055 3006 31.3
Amato169-C12 11753 8938 7954 8551 4863 3650 31.1
Amato168-G6 13588 10637 9539 10162 6046 3838 28.2
Amato167-G2 11647 9198 8035 8646 4978 3316 28.5
Amato166-L3 17799 14313 13141 13870 7962 5628 31.6
Amato165-H2 15473 12057 10930 11762 6757 4361 28.2
Amato192-G10 18482 14612 13125 13858 8422 5098 27.6
Amato164-D1 17969 14095 12937 13623 7944 5782 32.2
Amato191-G5 9896 7933 6870 7448 3575 3091 31.2
Amato190-G1 9233 7313 6304 6781 3315 2564 27.8
Amato189-L2 7076 5528 4849 5228 2923 2102 29.7
Amato188-H1 11108 9128 8245 8809 4822 3637 32.7
Amato186-C22 8906 6949 6078 6555 3483 2654 29.8
Amato185-C14 11852 9604 8410 9079 4618 3330 28.1
Amato184-G8 20392 16110 14195 15390 7705 5422 26.6
Amato183-G4 10996 8265 7077 7790 3616 2818 25.6
Amato182-L5 9173 7510 6757 7251 4429 3493 38.1
Amato181-L1 9861 7939 7078 7691 4322 2705 27.4
Amato163-C25 14043 11206 9796 10621 5774 3950 28.1
Amato180-D3 12587 10080 9096 9633 5339 4136 32.9
Amato179-C24 15949 12996 11686 12530 7554 5132 32.2
Amato178-C21 12372 9745 8504 9155 4941 3577 28.9
Amato177-C13 11875 6596 5697 6214 3059 2248 18.9
Amato176-G7 18838 13624 11833 12969 6646 4337 23.0
Amato175-G3 16275 13275 11478 12714 6026 4828 29.7
Amato174-L4 16232 13034 11856 12575 6308 4772 29.4
Amato173-H3 10321 8560 7781 8338 4886 3205 31.1
Amato172-D2 12226 9593 8587 9125 4675 3437 28.1
Amato171-C23 12480 9987 8600 9491 4675 3698 29.6
Amato162-C15 9644 6505 5671 6127 3321 2557 26.5
Amato161-C11 9977 6772 5754 6278 3022 2267 22.7
Could you try running just the forward reads? That means no reverse reads, and no merging.
I ran filterAndTrim
for just the forward and reverse reads by themselves. I got more reads saved from the filter step when I ran the forward reads alone compared to the reverse reads. For these tests I kept the maxEE value for them the same and the truncLens the same, so its truncLen = c(250,180) maxxEE = c(5,2)
. I also ran the forward reads with a maxEE value of 2 to see how the EE value would affect the forward reads.
Forward read results EE = 5
reads.in reads.out
SRR10190028_1.fastq.gz 9591 8385
SRR10190029_1.fastq.gz 11753 10113
SRR10190030_1.fastq.gz 13588 12175
SRR10190031_1.fastq.gz 11647 10467
SRR10190032_1.fastq.gz 17799 16439
SRR10190033_1.fastq.gz 15473 13807
SRR10190034_1.fastq.gz 18482 16845
SRR10190035_1.fastq.gz 17969 16375
SRR10190036_1.fastq.gz 9896 9070
SRR10190037_1.fastq.gz 9233 8395
SRR10190038_1.fastq.gz 7076 6399
SRR10190039_1.fastq.gz 11108 10371
SRR10190040_1.fastq.gz 8906 8072
SRR10190041_1.fastq.gz 11852 10804
SRR10190042_1.fastq.gz 20392 18404
SRR10190043_1.fastq.gz 10996 9673
SRR10190044_1.fastq.gz 9173 8500
SRR10190045_1.fastq.gz 9861 9125
SRR10190046_1.fastq.gz 14043 12701
SRR10190047_1.fastq.gz 12587 11582
SRR10190048_1.fastq.gz 15949 14662
SRR10190049_1.fastq.gz 12372 11226
SRR10190050_1.fastq.gz 11875 7477
SRR10190051_1.fastq.gz 18838 15973
SRR10190052_1.fastq.gz 16275 15107
SRR10190053_1.fastq.gz 16232 14921
SRR10190054_1.fastq.gz 10321 9459
SRR10190055_1.fastq.gz 12226 10852
SRR10190056_1.fastq.gz 12480 11469
SRR10190057_1.fastq.gz 9644 7465
SRR10190058_1.fastq.gz 9977 7739
Forward read results EE = 2
reads.in reads.out
SRR10190028_1.fastq.gz 9591 6796
SRR10190029_1.fastq.gz 11753 8213
SRR10190030_1.fastq.gz 13588 9694
SRR10190031_1.fastq.gz 11647 8444
SRR10190032_1.fastq.gz 17799 13662
SRR10190033_1.fastq.gz 15473 11511
SRR10190034_1.fastq.gz 18482 13410
SRR10190035_1.fastq.gz 17969 13726
SRR10190036_1.fastq.gz 9896 7540
SRR10190037_1.fastq.gz 9233 6914
SRR10190038_1.fastq.gz 7076 5358
SRR10190039_1.fastq.gz 11108 8685
SRR10190040_1.fastq.gz 8906 6602
SRR10190041_1.fastq.gz 11852 8988
SRR10190042_1.fastq.gz 20392 15244
SRR10190043_1.fastq.gz 10996 7957
SRR10190044_1.fastq.gz 9173 7105
SRR10190045_1.fastq.gz 9861 7659
SRR10190046_1.fastq.gz 14043 10259
SRR10190047_1.fastq.gz 12587 9792
SRR10190048_1.fastq.gz 15949 12144
SRR10190049_1.fastq.gz 12372 9041
SRR10190050_1.fastq.gz 11875 6059
SRR10190051_1.fastq.gz 18838 13086
SRR10190052_1.fastq.gz 16275 12589
SRR10190053_1.fastq.gz 16232 12470
SRR10190054_1.fastq.gz 10321 8032
SRR10190055_1.fastq.gz 12226 9128
SRR10190056_1.fastq.gz 12480 9379
SRR10190057_1.fastq.gz 9644 6049
SRR10190058_1.fastq.gz 9977 6273
Reverse reads results
reads.in reads.out
SRR10190028_2.fastq.gz 9591 7374
SRR10190029_2.fastq.gz 11753 9105
SRR10190030_2.fastq.gz 13588 10836
SRR10190031_2.fastq.gz 11647 9360
SRR10190032_2.fastq.gz 17799 14538
SRR10190033_2.fastq.gz 15473 12324
SRR10190034_2.fastq.gz 18482 14845
SRR10190035_2.fastq.gz 17969 14279
SRR10190036_2.fastq.gz 9896 8079
SRR10190037_2.fastq.gz 9233 7433
SRR10190038_2.fastq.gz 7076 5633
SRR10190039_2.fastq.gz 11108 9328
SRR10190040_2.fastq.gz 8906 7058
SRR10190041_2.fastq.gz 11852 9760
SRR10190042_2.fastq.gz 20392 16361
SRR10190043_2.fastq.gz 10996 8386
SRR10190044_2.fastq.gz 9173 7657
SRR10190045_2.fastq.gz 9861 8101
SRR10190046_2.fastq.gz 14043 11412
SRR10190047_2.fastq.gz 12587 10234
SRR10190048_2.fastq.gz 15949 13224
SRR10190049_2.fastq.gz 12372 9932
SRR10190050_2.fastq.gz 11875 6738
SRR10190051_2.fastq.gz 18838 13858
SRR10190052_2.fastq.gz 16275 13443
SRR10190053_2.fastq.gz 16232 13259
SRR10190054_2.fastq.gz 10321 8674
SRR10190055_2.fastq.gz 12226 9768
SRR10190056_2.fastq.gz 12480 10151
SRR10190057_2.fastq.gz 9644 6624
SRR10190058_2.fastq.gz 9977 6908
I think the maxEE value for the forward reads is needed since with the truncLen value I chose the forward reads incorporates a lot of low quality scores (in the 20s) into my filter process. But with these results seen here does it show that the reverse reads might be the issue for our loss in counts when we merge?
Could you try running just the forward reads? That means no reverse reads, and no merging.
To clarify, what I mean here is that you should run just the forward reads through the entire DADA2 workflow. From filtering all the way through to chimera removal.
For the reasons described above, this could really help clarify what is going on with your data.
I ran through the FWD reads and here is the outputs of the read counts.
Output for just FWD reads
raw_reads filtered ASVs_F no_chimera final_perc_reads_retained
Amato170-C16 9591 8385 7526 6523 68.0
Amato169-C12 11753 10113 8987 8077 68.7
Amato168-G6 13588 12175 10874 9117 67.1
Amato167-G2 11647 10467 9159 7550 64.8
Amato166-L3 17799 16439 15052 12743 71.6
Amato165-H2 15473 13807 12469 10555 68.2
Amato192-G10 18482 16845 15118 12048 65.2
Amato164-D1 17969 16375 15016 12740 70.9
Amato191-G5 9896 9070 7875 7304 73.8
Amato190-G1 9233 8395 7232 6449 69.8
Amato189-L2 7076 6399 5588 5080 71.8
Amato188-H1 11108 10371 9358 8580 77.2
Amato186-C22 8906 8072 7034 6542 73.5
Amato185-C14 11852 10804 9456 8473 71.5
Amato184-G8 20392 18404 16187 13990 68.6
Amato183-G4 10996 9673 8277 7397 67.3
Amato182-L5 9173 8500 7617 6949 75.8
Amato181-L1 9861 9125 8175 6729 68.2
Amato163-C25 14043 12701 11068 9691 69.0
Amato180-D3 12587 11582 10476 9111 72.4
Amato179-C24 15949 14662 13171 11523 72.2
Amato178-C21 12372 11226 9747 8468 68.4
Amato177-C13 11875 7477 6426 5732 48.3
Amato176-G7 18838 15973 13844 11509 61.1
Amato175-G3 16275 15107 13103 11922 73.3
Amato174-L4 16232 14921 13524 11939 73.6
Amato173-H3 10321 9459 8593 7281 70.5
Amato172-D2 12226 10852 9759 8565 70.1
Amato171-C23 12480 11469 9850 9094 72.9
Amato162-C15 9644 7465 6506 5919 61.4
Amato161-C11 9977 7739 6549 6026 60.4
Overall from this run it doesn't look like the loss of reads is that big. Looks like the biggest loss of reads is just in the filter and trimming step, which lines up with what is said in the DADA2 16S workflow. I will run through this type of test with the REV reads and post it those results when its complete.
The above test the truncLen = 250
, maxEE = 5
(for the FWD reads run). I ran the REV reads just by themselves so we can see what they look like as well. truncLen = 180
, maxEE = 2
.
REV reads run
raw_reads filtered ASVs_R no_chimera final_perc_reads_retained
Amato170-C16 9591 7374 6922 6024 62.8
Amato169-C12 11753 9105 8603 7405 63.0
Amato168-G6 13588 10836 10299 7804 57.4
Amato167-G2 11647 9360 8773 6620 56.8
Amato166-L3 17799 14538 14039 10978 61.7
Amato165-H2 15473 12324 11895 9750 63.0
Amato192-G10 18482 14845 14087 10465 56.6
Amato164-D1 17969 14279 13810 11073 61.6
Amato191-G5 9896 8079 7515 6834 69.1
Amato190-G1 9233 7433 6872 6140 66.5
Amato189-L2 7076 5633 5293 4662 65.9
Amato188-H1 11108 9328 8927 7389 66.5
Amato186-C22 8906 7058 6654 5744 64.5
Amato185-C14 11852 9760 9199 7591 64.0
Amato184-G8 20392 16361 15656 12633 62.0
Amato183-G4 10996 8386 7861 6796 61.8
Amato182-L5 9173 7657 7313 6034 65.8
Amato181-L1 9861 8101 7774 5658 57.4
Amato163-C25 14043 11412 10806 9074 64.6
Amato180-D3 12587 10234 9771 8021 63.7
Amato179-C24 15949 13224 12698 10637 66.7
Amato178-C21 12372 9932 9353 8205 66.3
Amato177-C13 11875 6738 6292 5427 45.7
Amato176-G7 18838 13858 13133 10161 53.9
Amato175-G3 16275 13443 12883 11054 67.9
Amato174-L4 16232 13259 12728 10458 64.4
Amato173-H3 10321 8674 8412 7477 72.4
Amato172-D2 12226 9768 9252 7532 61.6
Amato171-C23 12480 10151 9635 8225 65.9
Amato162-C15 9644 6624 6189 5269 54.6
Amato161-C11 9977 6908 6384 5493 55.1
Comparing the 2, it does look like the FWD reads do save more reads throughout this process, which lines up with what was said about Illumina sequencing in the DADA2 workflow. It looks like the filter step is the step that removes the most amount of reads.
Based on what you've shown here, I would recommend just going forward with the forward reads (no pun intended).
So the 60-70% retention of reads carried over from the raw data to the preprocessing of this data is an acceptable loss? I’m still new to microbiome analysis and want to make sure that this is an acceptable loss.
With continuing with just the forward reads would that change my results of taxonomic assignment and downstream analysis of this data? Would it just be considered doing a single-direction sequencing instead of a paired end sequencing?
Thank you for all of your help benjjneb!
If your loss is mostly at the filtering stage, and not at any of the specific steps outlined in the dada2 tutorial, then yes, 60-70% retention from raw reads to final output is normal.
With continuing with just the forward reads would that change my results of taxonomic assignment and downstream analysis of this data? Would it just be considered doing a single-direction sequencing instead of a paired end sequencing?
Yes it would be considered single-direction sequencing. Given the issues described above, it would be changing your analysis of the data... for the better.
Thank you for all of your help! I'm still learning microbiome analysis and this definitely helped me out. Solved my issue I was running into.
I'm trying to analyze a set of Paired FastQ files with dada2 in R. In my analysis I was doing before I realized that the raw reads still had their primers attached. So I went through the ITS workflow to remove my primers. But I'm limited to not using cutadapt function currently. So I was searching the dada2 functions and found the removePrimers function and decided to use that function to help me remove my primers.
With my testing I'm running into interesting issues where removePrimers function doesn't remove all my primers in my fastQ files and creates an imbalance of read counts where I can't use filterAndTrim function after using it.
Notes about data:
Code and Outputs I follow the workflow of the ITS analysis to search my raw reads to see if the primers are in there,
amp_f is a variable where I store all the forward reads and amp_R is a variable where I store all the reverse reads.
So from here we can see the original count of primers in my raw reads. After I get this data I'll show 2 separate methods of removing primers to see which one is the best. Both methods don't remove all primers in my data and I don't know what might be causing this issue.
Test 1, using the removePrimers function
Above is my code block and output to show how I'm using the function and my results. I decided to use the removePrimers function in this way. not including the reverse primer in its parameters since that was causing a huge deletion of my raw reads (file size of my raw reads would go from ~2000-1000 kb in size to 5-1 kb in size files) after the function was done. As you can see this function doesn't even remove all of the primers in my raw reads and after doing other tests with the parameters, like changing the size of the mismatch parameter I can't seem to remove all of the primers in my data.
The other issue when I use removePrimers() is that I can't use filterAndTrim() afterwards, the error that it throws up to me says that the number of reads between the edited forward and reverse files don't match up, like the forward reads have a smaller count compared to the reverse reads Error:
Test 2, using filterAndTrim() to remove primers I saw on this issue page that someone else was having issues with removing all of the primers form their data and the solution for them, if I remember correctly, was using filterAndTrim to remove their primers just be using trimLeft to cut off the first 19 bases for the forward reads and 20 bases for the reverse reads. So I tried to implement that into my data since removePrimers() is giving me some issues.
But we can see that it doesn't remove all of the primers from my data either, it even leaves me with more reads with primers in them than removePrimers(). So I'm currently stuck at a wall where the methods I can use to remove my primers are limited to these 2 options, at least these are the 2 options that I know of that are solely in R.
In my research I know that one of the assumptions for using dada2 is that I need to remove all non-biological nucleotides and I was wondering if having ~3% of the reads (gathered from test 2) will break the dada2 algorithms to help me turn this raw data into data I can put into phyloseq to analyze?