Merge different _learnErrors_ outputs to pseudo-pool samples with estimated error rates highly variable?

Mompel226 commented 3 years ago

Dear Ben, Firstly I would like to thank you for all your amazing work. I really appreciate it. You might be able to help with an issue I have found with my samples. I have 16S amplicons obtained with a 2-step Nextera approach and Walters modified EMP primers. All of the samples were analysed in the same sequencing run. The samples are from different origin: marine sediment (SED), seawater (W), fish skin (S), fish mucosa (M) and fish digesta (gut content) (D). When running the usual DADA2 pipeline with all my samples together, the estimated error plot looked normal. All_samples However, the sample inference step removes a big number of reads in my sediment samples, which seems odd since it doesn’t happen to the other samples - maybe a bit with the water samples. Filter parameters: truncLen=c(250,240), trimLeft=c(0,0), maxN=0, maxEE=c(5,5), truncQ=2

	input	filtered	denoised	merged	tabled	nonchim
D1	30048	28766	28340	27834	27834	26300
D10	16073	14793	13272	12153	12153	11049
D11	31425	29110	28893	28564	28564	28403
D2	30839	27979	27019	25807	25807	23731
D3	29722	28344	26804	25496	25496	24376
D4	28384	26204	24777	23142	23142	22302
D5	46799	44943	44861	44694	44694	44096
D6	49874	48184	48114	47869	47869	47678
D7	38111	36691	36529	36374	36374	35879
D9	19180	13089	12810	12488	12488	12467
M1	31380	21154	20912	20634	20634	20538
M10	26060	19220	18937	18733	18733	18705
M11	35005	32354	32324	32291	32291	32291
M2	30007	22299	22235	22175	22175	22162
M3	28610	23212	23052	22843	22843	22631
M4	22670	18442	18325	18114	18114	17734
M5	40751	38521	38407	38105	38105	36814
M6	61221	59246	59218	58956	58956	58877
M9	27145	19116	18911	18642	18642	18551
S1	57753	20405	20264	20139	20139	20111
S10	24299	15913	15832	15745	15745	15723
S11	26918	11918	11859	11818	11818	11786
S3	26819	14584	14479	14358	14358	14258
S4	28148	18099	17991	17922	17922	17899
S5	24039	20280	19852	19146	19146	16026
S7	22802	20224	19866	18884	18884	13975
S8	16044	12478	12197	11677	11677	9852
S9	20767	12349	12323	12251	12251	12234
Sed1	19864	**18880	11736**	8204	8204	7995
Sed2	20763	**19945	12675**	9584	9584	9482
Sed3	18521	**17688	10600**	7642	7642	7629
Sed4	22430	**21331	13197**	8921	8921	8695
Sed5	25263	**24194	16138**	12308	12308	11821
W1	21760	20902	17930	15256	15256	11808
W2	23099	22275	19261	17034	17034	13446
W3	20952	20097	16321	13851	13851	12117
W4	20754	19828	16309	13543	13543	12040
W5	19616	18874	15792	13728	13728	11953

Considering that the sediment samples are the ones with the biggest number of different known and unknown bacterial species, probably the high number of low abundance species is affecting the error model or the inference step. Furthermore, my samples are really different and the quality of the sequences varies considerably between them. Nonetheless, the sediment samples (Sed1-5) have all amazing quality up to the 250th bp. Not to mention that in order to learn this error model when processing all the samples together, the software only uses a limited number of randomly selected samples (15 in my case).

QUALITY PROFILE OF READS BEFORE FILTERING Reads_quality

To solve this issue, I thought about pre-processing the reads by sample type groups (Sediment, Water, Skin, Mucosa, and Digesta). This allows the learnErrors method to use more similar samples (with similar quality reads) and estimate error rates better adjusted for the type of sample. When looking at the error rates graphs we can clearly see that different types of samples have different estimated error rates.

SEDIMENT Only_SED

WATER Only_Water

SKIN Only_Skin

MUCOSA Only_Mucosa

DIGESTA Only_Digesta

Using sample-specific estimated error rates, the number of reads that passed the inference step increased considerably. However, This is the point where I don't understand what to do next. I would like to pseudo-pool ALL of my samples to increase sensitivity since each sample is more or less related to each other - all samples were collected in the same geographic area. I am afraid that pseudo-pooling each sample type independently and then merging the ASV tables at the end might not be the best approach. I might be introducing a bias and making the samples with the same origin be even more similar to each other. Nonetheless, as I understand, if I run the learnErrors function with a certain type of sample then the dada function will only use the error estimates of these samples. Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?

I apologise for such a lengthy message, I just wanted to make sure everything is clear. Thank you very much in advance. No matter the outcome I appreciate your time and help. Have a nice day.

benjjneb commented 3 years ago

Is there a way to merge different learnErrors outputs and then run the dada function with pseudo-pooling activated?

So I think what you want to do is to denoise different sample sets with their set-specific error models, and then "pseudo-pool" across all the sets of samples. If so, yes that is possible, albeit not with quite the convenience of pool="pseudo".

pool="pseudo" is really just a convenience way to use the dada2 priors functionality: https://benjjneb.github.io/dada2/pseudo.html

What pseudo-pooling does is to run dada once in the default sample indpendent fashion, pick a set of ASVs from the resulting table to serve as priors (by default all ASVs that appear in 2+ samples), and then run dada again with those priors specified. This can be easily done by hand, which is useful in your case where you are using different error models. The code would look something like...

dd.skin <- dada(filt.skin, err.skin, ...)
dd.h20 <- dada(filt.h20, err.h20, ...)
# and so forth
st.intermediate <- mergeSequenceTables(dd.skin, dd.h20, ...)
# Perhaps remove chimeras
priors <- getSequences(st.intermediate)[colSums(st.intermediate>0) >2]
# For example, could use whatever criteria to choose the sequences to pseudo-pool
dd.skin.pseudo_pooled <- dada(filt.skin, err.skin, priors=priors, ...)
# and so forth
st.pseudo_pooled <- mergeSequenceTables(dd.skin.pseudo_pooled, ...)

Is that enough to go on? It's basically just doing exaclty what pseudo-pooling does, but by hand so the error models can also be modulated between sets of samples.

Mompel226 commented 3 years ago

Dear Ben,

Thank you very much for your answer. It is just what I was asking for.

As an extra, after removing chimeras, I have combined together sequences that are identical (collapseNoMismatch ), and after obtaining the merged sequence table (st.pseudo_pooled) I have once again removed chimeras and combined together identical sequences.

Below you can find the summary table that clearly shows how denoising different sample sets with their set-specific error models and then pseudo-pooling with all the samples' data (general pseudo-pooling) increases the number of reads obtained. Mainly for samples with high rates of low abundant species such as sediment or water.

Denoising different sample sets with their set-specific error and then pseudo-pooling using only the data of the biological replicates (replicate pseudo-pooling) also tends to improve the results, but not always.

Interestingly, denoising the sample sets with a general error model and then general pseudo-pooling also provides good results just after the denoising step. However, after removing chimeras and collapsing the sequences, the number of reads is considerably reduced, making the set-specific error approach a better option.

Thank you for your time, Ben. Best wishes, Daniel

	input	filtered	General denoising without pseudo-pooling	Final reads (From previous column)	General denoising with general pseudo-pooling	Final reads (From previous column)	Specific denoising without pseudo-pooling	Final reads (From previous column)	Specific denoising with replicate pseudo-pooling	Final reads (From previous column)	Specific denoising with general pseudo-pooling BEST RESULTS	Final reads (From previous column) BEST RESULTS
S1	57753	20405	20264	20111	20326	20077	20267	20111	20273	20094	20278	20120
S10	24299	15913	15832	15723	15852	15740	15831	15723	15843	15754	15845	15759
S11	26918	11918	11859	11786	11883	11807	11859	11786	11860	11788	11883	11801
S3	26819	14584	14479	14258	14499	14305	14475	14257	14465	14264	14488	14295
S4	28148	18099	17991	17899	18023	17935	17990	17898	18005	17914	18021	17928
S5	24039	20280	19852	16026	19947	15944	19861	16003	19905	15953	19871	16025
S7	22802	20224	19866	13975	19957	13932	19860	13950	19860	13882	19867	13973
S8	16044	12478	12197	9852	12244	9876	12196	9851	12156	9807	12207	9912
S9	20767	12349	12323	12234	12333	12269	12323	12234	12330	12241	12332	12243
D1	30048	28766	28340	26300	28434	26386	28320	26313	28403	26358	28381	26414
D10	16073	14793	13272	11049	13521	11279	13307	11073	13335	11118	13441	11281
D11	31425	29110	28893	28403	28938	28412	28896	28417	28916	28388	28919	28456
D2	30839	27979	27019	23731	27168	23996	27026	23774	27106	23839	27144	23962
D3	29722	28344	26804	24376	26947	24367	26807	24378	26720	24183	26878	24423
D4	28384	26204	24777	22302	24971	22508	24796	22311	24829	22433	24876	22563
D5	46799	44943	44861	44096	44878	44108	44862	44099	44890	44152	44873	44104
D6	49874	48184	48114	47678	48125	47632	48114	47696	48119	47662	48122	47691
D7	38111	36691	36529	35879	36671	35893	36652	36003	36658	35805	36656	35992
D9	19180	13089	12810	12467	12874	12507	12812	12458	12835	12486	12848	12502
M1	31380	21154	20912	20538	20978	20521	20852	20469	20855	20495	20903	20544
M10	26060	19220	18937	18705	19021	18783	18960	18587	18955	18606	19004	18633
M11	35005	32354	32324	32291	32334	32302	32324	32291	32332	32302	32332	32303
M2	30007	22299	22235	22162	22259	22172	22233	22158	22232	22161	22247	22175
M3	28610	23212	23052	22631	23092	22672	23059	22633	23101	22699	23084	22663
M4	22670	18442	18325	17734	18349	17767	18322	17731	18337	17726	18342	17769
M5	40751	38521	38407	36814	38423	36793	38410	36787	38405	36788	38418	36797
M6	61221	59246	59218	58877	59230	58849	59221	58860	59217	58825	59222	58847
M9	27145	19116	18911	18551	18992	18610	18936	18485	18957	18511	18977	18597
W1	21760	20902	17930	11808	18196	12015	18826	12270	18761	12174	18918	12426
W2	23099	22275	19261	13446	19565	13532	20179	13709	20094	13560	20267	13844
W3	20952	20097	16321	12117	16703	12052	17356	12751	17316	12285	17508	12940
W4	20754	19828	16309	12040	16614	11764	17228	12584	17210	12146	17340	12660
W5	19616	18874	15792	11953	16124	11832	16627	12348	16693	12043	16740	12405
Sed1	19864	18880	11736	7995	12250	8929	13909	9014	13773	9494	14135	9543
Sed2	20763	19945	12675	9482	13086	9646	14939	10844	14795	10545	15076	10938
Sed3	18521	17688	10600	7629	11096	8015	12669	8734	12650	9013	12971	9355
Sed4	22430	21331	13197	8695	13760	9500	15535	9675	15510	10149	15813	10247
Sed5	25263	24194	16138	11821	16616	12131	18690	13422	18500	13161	18849	13583

benjjneb / dada2

Merge different _learnErrors_ outputs to pseudo-pool samples with estimated error rates highly variable? #1317