benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Merge different _learnErrors_ outputs to pseudo-pool samples with estimated error rates highly variable? #1317

Closed Mompel226 closed 3 years ago

Mompel226 commented 3 years ago

Dear Ben, Firstly I would like to thank you for all your amazing work. I really appreciate it. You might be able to help with an issue I have found with my samples. I have 16S amplicons obtained with a 2-step Nextera approach and Walters modified EMP primers. All of the samples were analysed in the same sequencing run. The samples are from different origin: marine sediment (SED), seawater (W), fish skin (S), fish mucosa (M) and fish digesta (gut content) (D). When running the usual DADA2 pipeline with all my samples together, the estimated error plot looked normal. All_samples However, the sample inference step removes a big number of reads in my sediment samples, which seems odd since it doesn’t happen to the other samples - maybe a bit with the water samples. Filter parameters: truncLen=c(250,240), trimLeft=c(0,0), maxN=0, maxEE=c(5,5), truncQ=2

  input filtered denoised merged tabled nonchim
D1 30048 28766 28340 27834 27834 26300
D10 16073 14793 13272 12153 12153 11049
D11 31425 29110 28893 28564 28564 28403
D2 30839 27979 27019 25807 25807 23731
D3 29722 28344 26804 25496 25496 24376
D4 28384 26204 24777 23142 23142 22302
D5 46799 44943 44861 44694 44694 44096
D6 49874 48184 48114 47869 47869 47678
D7 38111 36691 36529 36374 36374 35879
D9 19180 13089 12810 12488 12488 12467
M1 31380 21154 20912 20634 20634 20538
M10 26060 19220 18937 18733 18733 18705
M11 35005 32354 32324 32291 32291 32291
M2 30007 22299 22235 22175 22175 22162
M3 28610 23212 23052 22843 22843 22631
M4 22670 18442 18325 18114 18114 17734
M5 40751 38521 38407 38105 38105 36814
M6 61221 59246 59218 58956 58956 58877
M9 27145 19116 18911 18642 18642 18551
S1 57753 20405 20264 20139 20139 20111
S10 24299 15913 15832 15745 15745 15723
S11 26918 11918 11859 11818 11818 11786
S3 26819 14584 14479 14358 14358 14258
S4 28148 18099 17991 17922 17922 17899
S5 24039 20280 19852 19146 19146 16026
S7 22802 20224 19866 18884 18884 13975
S8 16044 12478 12197 11677 11677 9852
S9 20767 12349 12323 12251 12251 12234
Sed1 19864 **18880 11736** 8204 8204 7995
Sed2 20763 **19945 12675** 9584 9584 9482
Sed3 18521 **17688 10600** 7642 7642 7629
Sed4 22430 **21331 13197** 8921 8921 8695
Sed5 25263 **24194 16138** 12308 12308 11821
W1 21760 20902 17930 15256 15256 11808
W2 23099 22275 19261 17034 17034 13446
W3 20952 20097 16321 13851 13851 12117
W4 20754 19828 16309 13543 13543 12040
W5 19616 18874 15792 13728 13728 11953

Considering that the sediment samples are the ones with the biggest number of different known and unknown bacterial species, probably the high number of low abundance species is affecting the error model or the inference step. Furthermore, my samples are really different and the quality of the sequences varies considerably between them. Nonetheless, the sediment samples (Sed1-5) have all amazing quality up to the 250th bp. Not to mention that in order to learn this error model when processing all the samples together, the software only uses a limited number of randomly selected samples (15 in my case).

QUALITY PROFILE OF READS BEFORE FILTERING Reads_quality

To solve this issue, I thought about pre-processing the reads by sample type groups (Sediment, Water, Skin, Mucosa, and Digesta). This allows the learnErrors method to use more similar samples (with similar quality reads) and estimate error rates better adjusted for the type of sample. When looking at the error rates graphs we can clearly see that different types of samples have different estimated error rates.

SEDIMENT Only_SED

WATER Only_Water

SKIN Only_Skin

MUCOSA Only_Mucosa

DIGESTA Only_Digesta

Using sample-specific estimated error rates, the number of reads that passed the inference step increased considerably. However, This is the point where I don't understand what to do next. I would like to pseudo-pool ALL of my samples to increase sensitivity since each sample is more or less related to each other - all samples were collected in the same geographic area. I am afraid that pseudo-pooling each sample type independently and then merging the ASV tables at the end might not be the best approach. I might be introducing a bias and making the samples with the same origin be even more similar to each other. Nonetheless, as I understand, if I run the learnErrors function with a certain type of sample then the dada function will only use the error estimates of these samples. Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?

I apologise for such a lengthy message, I just wanted to make sure everything is clear. Thank you very much in advance. No matter the outcome I appreciate your time and help. Have a nice day.

benjjneb commented 3 years ago

Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?

So I think what you want to do is to denoise different sample sets with their set-specific error models, and then "pseudo-pool" across all the sets of samples. If so, yes that is possible, albeit not with quite the convenience of pool="pseudo".

pool="pseudo" is really just a convenience way to use the dada2 priors functionality: https://benjjneb.github.io/dada2/pseudo.html

What pseudo-pooling does is to run dada once in the default sample indpendent fashion, pick a set of ASVs from the resulting table to serve as priors (by default all ASVs that appear in 2+ samples), and then run dada again with those priors specified. This can be easily done by hand, which is useful in your case where you are using different error models. The code would look something like...

dd.skin <- dada(filt.skin, err.skin, ...)
dd.h20 <- dada(filt.h20, err.h20, ...)
# and so forth
st.intermediate <- mergeSequenceTables(dd.skin, dd.h20, ...)
# Perhaps remove chimeras
priors <- getSequences(st.intermediate)[colSums(st.intermediate>0) >2]
# For example, could use whatever criteria to choose the sequences to pseudo-pool
dd.skin.pseudo_pooled <- dada(filt.skin, err.skin, priors=priors, ...)
# and so forth
st.pseudo_pooled <- mergeSequenceTables(dd.skin.pseudo_pooled, ...)

Is that enough to go on? It's basically just doing exaclty what pseudo-pooling does, but by hand so the error models can also be modulated between sets of samples.

Mompel226 commented 3 years ago

Dear Ben,

Thank you very much for your answer. It is just what I was asking for.

As an extra, after removing chimeras, I have combined together sequences that are identical (collapseNoMismatch ), and after obtaining the merged sequence table (st.pseudo_pooled) I have once again removed chimeras and combined together identical sequences.

Below you can find the summary table that clearly shows how denoising different sample sets with their set-specific error models and then pseudo-pooling with all the samples' data (general pseudo-pooling) increases the number of reads obtained. Mainly for samples with high rates of low abundant species such as sediment or water.

Denoising different sample sets with their set-specific error and then pseudo-pooling using only the data of the biological replicates (replicate pseudo-pooling) also tends to improve the results, but not always.

Interestingly, denoising the sample sets with a general error model and then general pseudo-pooling also provides good results just after the denoising step. However, after removing chimeras and collapsing the sequences, the number of reads is considerably reduced, making the set-specific error approach a better option.

Thank you for your time, Ben. Best wishes, Daniel

  input filtered General denoising without pseudo-pooling Final reads  (From previous column) General denoising with general pseudo-pooling Final reads  (From previous column) Specific denoising without pseudo-pooling Final reads  (From previous column) Specific denoising with replicate pseudo-pooling Final reads  (From previous column) Specific denoising with general pseudo-pooling BEST RESULTS Final reads  (From previous column) BEST RESULTS
S1 57753 20405 20264 20111 20326 20077 20267 20111 20273 20094 20278 20120
S10 24299 15913 15832 15723 15852 15740 15831 15723 15843 15754 15845 15759
S11 26918 11918 11859 11786 11883 11807 11859 11786 11860 11788 11883 11801
S3 26819 14584 14479 14258 14499 14305 14475 14257 14465 14264 14488 14295
S4 28148 18099 17991 17899 18023 17935 17990 17898 18005 17914 18021 17928
S5 24039 20280 19852 16026 19947 15944 19861 16003 19905 15953 19871 16025
S7 22802 20224 19866 13975 19957 13932 19860 13950 19860 13882 19867 13973
S8 16044 12478 12197 9852 12244 9876 12196 9851 12156 9807 12207 9912
S9 20767 12349 12323 12234 12333 12269 12323 12234 12330 12241 12332 12243
D1 30048 28766 28340 26300 28434 26386 28320 26313 28403 26358 28381 26414
D10 16073 14793 13272 11049 13521 11279 13307 11073 13335 11118 13441 11281
D11 31425 29110 28893 28403 28938 28412 28896 28417 28916 28388 28919 28456
D2 30839 27979 27019 23731 27168 23996 27026 23774 27106 23839 27144 23962
D3 29722 28344 26804 24376 26947 24367 26807 24378 26720 24183 26878 24423
D4 28384 26204 24777 22302 24971 22508 24796 22311 24829 22433 24876 22563
D5 46799 44943 44861 44096 44878 44108 44862 44099 44890 44152 44873 44104
D6 49874 48184 48114 47678 48125 47632 48114 47696 48119 47662 48122 47691
D7 38111 36691 36529 35879 36671 35893 36652 36003 36658 35805 36656 35992
D9 19180 13089 12810 12467 12874 12507 12812 12458 12835 12486 12848 12502
M1 31380 21154 20912 20538 20978 20521 20852 20469 20855 20495 20903 20544
M10 26060 19220 18937 18705 19021 18783 18960 18587 18955 18606 19004 18633
M11 35005 32354 32324 32291 32334 32302 32324 32291 32332 32302 32332 32303
M2 30007 22299 22235 22162 22259 22172 22233 22158 22232 22161 22247 22175
M3 28610 23212 23052 22631 23092 22672 23059 22633 23101 22699 23084 22663
M4 22670 18442 18325 17734 18349 17767 18322 17731 18337 17726 18342 17769
M5 40751 38521 38407 36814 38423 36793 38410 36787 38405 36788 38418 36797
M6 61221 59246 59218 58877 59230 58849 59221 58860 59217 58825 59222 58847
M9 27145 19116 18911 18551 18992 18610 18936 18485 18957 18511 18977 18597
W1 21760 20902 17930 11808 18196 12015 18826 12270 18761 12174 18918 12426
W2 23099 22275 19261 13446 19565 13532 20179 13709 20094 13560 20267 13844
W3 20952 20097 16321 12117 16703 12052 17356 12751 17316 12285 17508 12940
W4 20754 19828 16309 12040 16614 11764 17228 12584 17210 12146 17340 12660
W5 19616 18874 15792 11953 16124 11832 16627 12348 16693 12043 16740 12405
Sed1 19864 18880 11736 7995 12250 8929 13909 9014 13773 9494 14135 9543
Sed2 20763 19945 12675 9482 13086 9646 14939 10844 14795 10545 15076 10938
Sed3 18521 17688 10600 7629 11096 8015 12669 8734 12650 9013 12971 9355
Sed4 22430 21331 13197 8695 13760 9500 15535 9675 15510 10149 15813 10247
Sed5 25263 24194 16138 11821 16616 12131 18690 13422 18500 13161 18849 13583