Closed Mompel226 closed 3 years ago
Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?
So I think what you want to do is to denoise different sample sets with their set-specific error models, and then "pseudo-pool" across all the sets of samples. If so, yes that is possible, albeit not with quite the convenience of pool="pseudo"
.
pool="pseudo"
is really just a convenience way to use the dada2 priors
functionality: https://benjjneb.github.io/dada2/pseudo.html
What pseudo-pooling does is to run dada
once in the default sample indpendent fashion, pick a set of ASVs from the resulting table to serve as priors (by default all ASVs that appear in 2+ samples), and then run dada
again with those priors specified. This can be easily done by hand, which is useful in your case where you are using different error models. The code would look something like...
dd.skin <- dada(filt.skin, err.skin, ...)
dd.h20 <- dada(filt.h20, err.h20, ...)
# and so forth
st.intermediate <- mergeSequenceTables(dd.skin, dd.h20, ...)
# Perhaps remove chimeras
priors <- getSequences(st.intermediate)[colSums(st.intermediate>0) >2]
# For example, could use whatever criteria to choose the sequences to pseudo-pool
dd.skin.pseudo_pooled <- dada(filt.skin, err.skin, priors=priors, ...)
# and so forth
st.pseudo_pooled <- mergeSequenceTables(dd.skin.pseudo_pooled, ...)
Is that enough to go on? It's basically just doing exaclty what pseudo-pooling does, but by hand so the error models can also be modulated between sets of samples.
Dear Ben,
Thank you very much for your answer. It is just what I was asking for.
As an extra, after removing chimeras, I have combined together sequences that are identical (collapseNoMismatch ), and after obtaining the merged sequence table (st.pseudo_pooled) I have once again removed chimeras and combined together identical sequences.
Below you can find the summary table that clearly shows how denoising different sample sets with their set-specific error models and then pseudo-pooling with all the samples' data (general pseudo-pooling) increases the number of reads obtained. Mainly for samples with high rates of low abundant species such as sediment or water.
Denoising different sample sets with their set-specific error and then pseudo-pooling using only the data of the biological replicates (replicate pseudo-pooling) also tends to improve the results, but not always.
Interestingly, denoising the sample sets with a general error model and then general pseudo-pooling also provides good results just after the denoising step. However, after removing chimeras and collapsing the sequences, the number of reads is considerably reduced, making the set-specific error approach a better option.
Thank you for your time, Ben. Best wishes, Daniel
input | filtered | General denoising without pseudo-pooling | Final reads (From previous column) | General denoising with general pseudo-pooling | Final reads (From previous column) | Specific denoising without pseudo-pooling | Final reads (From previous column) | Specific denoising with replicate pseudo-pooling | Final reads (From previous column) | Specific denoising with general pseudo-pooling BEST RESULTS | Final reads (From previous column) BEST RESULTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
S1 | 57753 | 20405 | 20264 | 20111 | 20326 | 20077 | 20267 | 20111 | 20273 | 20094 | 20278 | 20120 |
S10 | 24299 | 15913 | 15832 | 15723 | 15852 | 15740 | 15831 | 15723 | 15843 | 15754 | 15845 | 15759 |
S11 | 26918 | 11918 | 11859 | 11786 | 11883 | 11807 | 11859 | 11786 | 11860 | 11788 | 11883 | 11801 |
S3 | 26819 | 14584 | 14479 | 14258 | 14499 | 14305 | 14475 | 14257 | 14465 | 14264 | 14488 | 14295 |
S4 | 28148 | 18099 | 17991 | 17899 | 18023 | 17935 | 17990 | 17898 | 18005 | 17914 | 18021 | 17928 |
S5 | 24039 | 20280 | 19852 | 16026 | 19947 | 15944 | 19861 | 16003 | 19905 | 15953 | 19871 | 16025 |
S7 | 22802 | 20224 | 19866 | 13975 | 19957 | 13932 | 19860 | 13950 | 19860 | 13882 | 19867 | 13973 |
S8 | 16044 | 12478 | 12197 | 9852 | 12244 | 9876 | 12196 | 9851 | 12156 | 9807 | 12207 | 9912 |
S9 | 20767 | 12349 | 12323 | 12234 | 12333 | 12269 | 12323 | 12234 | 12330 | 12241 | 12332 | 12243 |
D1 | 30048 | 28766 | 28340 | 26300 | 28434 | 26386 | 28320 | 26313 | 28403 | 26358 | 28381 | 26414 |
D10 | 16073 | 14793 | 13272 | 11049 | 13521 | 11279 | 13307 | 11073 | 13335 | 11118 | 13441 | 11281 |
D11 | 31425 | 29110 | 28893 | 28403 | 28938 | 28412 | 28896 | 28417 | 28916 | 28388 | 28919 | 28456 |
D2 | 30839 | 27979 | 27019 | 23731 | 27168 | 23996 | 27026 | 23774 | 27106 | 23839 | 27144 | 23962 |
D3 | 29722 | 28344 | 26804 | 24376 | 26947 | 24367 | 26807 | 24378 | 26720 | 24183 | 26878 | 24423 |
D4 | 28384 | 26204 | 24777 | 22302 | 24971 | 22508 | 24796 | 22311 | 24829 | 22433 | 24876 | 22563 |
D5 | 46799 | 44943 | 44861 | 44096 | 44878 | 44108 | 44862 | 44099 | 44890 | 44152 | 44873 | 44104 |
D6 | 49874 | 48184 | 48114 | 47678 | 48125 | 47632 | 48114 | 47696 | 48119 | 47662 | 48122 | 47691 |
D7 | 38111 | 36691 | 36529 | 35879 | 36671 | 35893 | 36652 | 36003 | 36658 | 35805 | 36656 | 35992 |
D9 | 19180 | 13089 | 12810 | 12467 | 12874 | 12507 | 12812 | 12458 | 12835 | 12486 | 12848 | 12502 |
M1 | 31380 | 21154 | 20912 | 20538 | 20978 | 20521 | 20852 | 20469 | 20855 | 20495 | 20903 | 20544 |
M10 | 26060 | 19220 | 18937 | 18705 | 19021 | 18783 | 18960 | 18587 | 18955 | 18606 | 19004 | 18633 |
M11 | 35005 | 32354 | 32324 | 32291 | 32334 | 32302 | 32324 | 32291 | 32332 | 32302 | 32332 | 32303 |
M2 | 30007 | 22299 | 22235 | 22162 | 22259 | 22172 | 22233 | 22158 | 22232 | 22161 | 22247 | 22175 |
M3 | 28610 | 23212 | 23052 | 22631 | 23092 | 22672 | 23059 | 22633 | 23101 | 22699 | 23084 | 22663 |
M4 | 22670 | 18442 | 18325 | 17734 | 18349 | 17767 | 18322 | 17731 | 18337 | 17726 | 18342 | 17769 |
M5 | 40751 | 38521 | 38407 | 36814 | 38423 | 36793 | 38410 | 36787 | 38405 | 36788 | 38418 | 36797 |
M6 | 61221 | 59246 | 59218 | 58877 | 59230 | 58849 | 59221 | 58860 | 59217 | 58825 | 59222 | 58847 |
M9 | 27145 | 19116 | 18911 | 18551 | 18992 | 18610 | 18936 | 18485 | 18957 | 18511 | 18977 | 18597 |
W1 | 21760 | 20902 | 17930 | 11808 | 18196 | 12015 | 18826 | 12270 | 18761 | 12174 | 18918 | 12426 |
W2 | 23099 | 22275 | 19261 | 13446 | 19565 | 13532 | 20179 | 13709 | 20094 | 13560 | 20267 | 13844 |
W3 | 20952 | 20097 | 16321 | 12117 | 16703 | 12052 | 17356 | 12751 | 17316 | 12285 | 17508 | 12940 |
W4 | 20754 | 19828 | 16309 | 12040 | 16614 | 11764 | 17228 | 12584 | 17210 | 12146 | 17340 | 12660 |
W5 | 19616 | 18874 | 15792 | 11953 | 16124 | 11832 | 16627 | 12348 | 16693 | 12043 | 16740 | 12405 |
Sed1 | 19864 | 18880 | 11736 | 7995 | 12250 | 8929 | 13909 | 9014 | 13773 | 9494 | 14135 | 9543 |
Sed2 | 20763 | 19945 | 12675 | 9482 | 13086 | 9646 | 14939 | 10844 | 14795 | 10545 | 15076 | 10938 |
Sed3 | 18521 | 17688 | 10600 | 7629 | 11096 | 8015 | 12669 | 8734 | 12650 | 9013 | 12971 | 9355 |
Sed4 | 22430 | 21331 | 13197 | 8695 | 13760 | 9500 | 15535 | 9675 | 15510 | 10149 | 15813 | 10247 |
Sed5 | 25263 | 24194 | 16138 | 11821 | 16616 | 12131 | 18690 | 13422 | 18500 | 13161 | 18849 | 13583 |
Dear Ben, Firstly I would like to thank you for all your amazing work. I really appreciate it. You might be able to help with an issue I have found with my samples. I have 16S amplicons obtained with a 2-step Nextera approach and Walters modified EMP primers. All of the samples were analysed in the same sequencing run. The samples are from different origin: marine sediment (SED), seawater (W), fish skin (S), fish mucosa (M) and fish digesta (gut content) (D). When running the usual DADA2 pipeline with all my samples together, the estimated error plot looked normal. However, the sample inference step removes a big number of reads in my sediment samples, which seems odd since it doesn’t happen to the other samples - maybe a bit with the water samples. Filter parameters: truncLen=c(250,240), trimLeft=c(0,0), maxN=0, maxEE=c(5,5), truncQ=2
Considering that the sediment samples are the ones with the biggest number of different known and unknown bacterial species, probably the high number of low abundance species is affecting the error model or the inference step. Furthermore, my samples are really different and the quality of the sequences varies considerably between them. Nonetheless, the sediment samples (Sed1-5) have all amazing quality up to the 250th bp. Not to mention that in order to learn this error model when processing all the samples together, the software only uses a limited number of randomly selected samples (15 in my case).
QUALITY PROFILE OF READS BEFORE FILTERING
To solve this issue, I thought about pre-processing the reads by sample type groups (Sediment, Water, Skin, Mucosa, and Digesta). This allows the learnErrors method to use more similar samples (with similar quality reads) and estimate error rates better adjusted for the type of sample. When looking at the error rates graphs we can clearly see that different types of samples have different estimated error rates.
SEDIMENT
WATER
SKIN
MUCOSA
DIGESTA
Using sample-specific estimated error rates, the number of reads that passed the inference step increased considerably. However, This is the point where I don't understand what to do next. I would like to pseudo-pool ALL of my samples to increase sensitivity since each sample is more or less related to each other - all samples were collected in the same geographic area. I am afraid that pseudo-pooling each sample type independently and then merging the ASV tables at the end might not be the best approach. I might be introducing a bias and making the samples with the same origin be even more similar to each other. Nonetheless, as I understand, if I run the learnErrors function with a certain type of sample then the dada function will only use the error estimates of these samples. Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?
I apologise for such a lengthy message, I just wanted to make sure everything is clear. Thank you very much in advance. No matter the outcome I appreciate your time and help. Have a nice day.