buckleylab / HTSSIP

An R package for analyzing high throughput sequence stable isotope probing data
GNU General Public License v2.0
3 stars 3 forks source link

Accessing DESEq2 results from HRSIP command #10

Open blancaverag opened 4 years ago

blancaverag commented 4 years ago

I was trying to inspect the results of DESeq2 call within HRSIP to track some OTUs that I expected to be labeled given some other results but are not caught as such by HTSSIP. Is there a direct way to retrieve those (I can't see it in the manual) or should I ran all commands in the function separately to access thsi result ("res")?

Thank you

seb369 commented 4 years ago

The output dataframe from the HRSIP command should contain most relevant data from the DESeq2 call (pvalue, padj, log2 fold change, etc). Any OTUs that are missing from that dataframe likely didn't pass the sparsity filtering.

blancaverag commented 4 years ago

Thank you, @seb369. I can access pvalue, padj, log2FC in this way, but not baseMean (the mean of normalize counts).

seb369 commented 4 years ago

At this time the baseMean data are not maintained through the HRSIP function. This may be something I will add to an updated version. I would recommend in the mean time to follow the basic protocol for normalizing counts from DESeq2 separately.

blancaverag commented 4 years ago

I have followed your advice in the meantime. Looking into DESeq2_l2fc.R function I found out that p values are not calculated in the same way as in the DESeq2 original function (negative binomial generalized linear models +Wald test). From what I have read, that modelisation step was introduced to increase statistical power of detection of genes (OTUs in this case). I was wondering if maybe this model employed in DESeq2 original function not suitable for OTUs instead of genes or, essentially, why HTSSIP does not calculate p values in this DESeq2-way.

Also, what is the advantage of employing gm_mean instead of the default behaviour?

Thank you very much in advance.

I can make this a separate thread if you consider it is a really different topic; I just thought it was related to my original question.

seb369 commented 4 years ago

To your first question, HRSIP does calculate pvalue pretty much the same way, we just do it manually ourselves. The real difference is that our method does not do any outlier filtering based on cooks distances. This makes this analysis a bit less conservative than DESeq2. However there is discussion and even suggestion to use DESeq without this outlier filtering. For that, check out the DESeq2 tutorial. Essentially our pvalue calculation should be the same as running:

DESeq2::results(dds, independentFiltering = FALSE, lfcThreshold = l2fc_threshold, altHypothesis="greater", cooksCutoff=FALSE)

for most cases (following the other earlier filtering and DESeq steps).

To answer your second question, the modified version of the geometric means for estimateSizeFactors is to take into account that your OTU read counts often have a lot of 0's, which the standard geometric means that DESeq usually uses for this step has a problem with. Check out the manual for estimateSizeFactors and a few of the posts online about this (e.g. https://github.com/joey711/phyloseq/issues/445).

blancaverag commented 4 years ago

Thank you very much again for your answer.

As I said, by accessing all these HTSSIP/DESeq2 internal calculations, I was trying to track some OTUs that I expected to be labeled with H218O given some other results (quantification of 16S rRNA genes in gradient fractions) but are not caught as such by HTSSIP. I am afraid I still do not fully understand why these two approaches do not agree.

Since you may have a better understanding of what’s going on and what to look at, I am posting the specifics.

Quantification of 16S rRNA genes in gradient fractions shows that most of the sequences slightly shifted towards higher densities in the treated samples (see figure attached) (red and orange), as compared to the control (in blue), with most of them falling in the middle heavy region (MH), suggesting that almost the whole community started to incorporate the isotope. In this region, the curves for control and treated samples overlapped. When I run HTSSIP for detecting significantly labeled OTUs in this MH region, it only highlights one OTU as significantly labeled.

The only thing I can think of is that, since the response is weak and we do not see a full separation of the curves of control and treated samples in the range considered (MH) and the whole community seems to shift, it is well possible that more or less the same bacterial community is similarly represented in the control and treated samples in this fraction, which may interfere with methods for detecting labeled OTUs. However, since HTSSIP (or DESeq2) does not directly work with relative abundances, I am not sure how well it handles this kind of situations.

Any hint of what to look at to understand this and conciliate results would be highly appreciated.

16S_problem.pdf

seb369 commented 4 years ago

It is difficult to determine why you are not finding many incorporators with HTSSIP without really digging into the data, but here are some things to consider or look at. Many of these you might have already considered.

1) Sparsity filtering: Think about how you are removing sparse OTUs. You might want to be more strict to remove OTUs that are rare across your samples. This can increase your statistical power.

2) The number of potential incorporators: It has been shown that with more incorporators, the potential false negatives goes up. I've never worked with H218O, but I would imagine that this might be a problem. (https://doi.org/10.3389/fmicb.2018.00570)

3) Window choices: Whether you are doing HR-SIP or MW-HR-SIP consider the BD windows you are using to compare between treatment and controls. The type of isotope used determines how far the DNA will shift. Most of the examples are based on 13C so you might want to choose a window that is specific to 18O. Check out the function BD_shift(). This might help you pick an appropriate window.

4) Number of fractions: How many fractions are included in each BD window you are comparing? The more fractions the better. You should have more than 3 fractions but I usually try to get 6-9 or more if I have enough resolution. If you are doing MW-HR-SIP, consider larger and fewer windows.

5) Unfractionated differences: If you sequenced the unfractionated community (DNA extracted from your samples but not passed through the SIP gradient steps). Compare the Bray-Curtis dissimilarity between the paired treatment and control unfractionated communities. They should be relatively close (<0.2) as in effect they were treated the same save for the 18O vs. 16O. If you have >0.2 that should be ok, but if you have very high differences that might indicate that something else is different between your treatments that might be interfering with the SIP results. (https://doi.org/10.3389/fmicb.2018.00570)

6) Replicate gradients: Based on the plot you posted, it seems like you have replicate gradients. If you haven't done so, consider running this analysis on the combined replicates (when subsetting phyloseq, keep replicates together).

7) Consider how you are visually identifying the OTUs you think should be incorporators: You mentioned you track the OTUs across the gradients and that is how you expect them to be incorporators. How the abundance of OTUs change over a gradient might not always be as you expect. Sequencing is compositional so in some fractions where you have a high abundance of a specific OTU, the other OTUs will look like they are being depressed. Consequently, if that high abundance OTU takes up the label and shifts in BD, I might expect it to look like those other non-incorporators might look like they increase in relative abundance in some windows. There are other variations of this that you can consider. This is just a thought, it might not look like that depending on how complex your community is. Check out the example in https://link.springer.com/protocol/10.1007/978-1-4939-9721-3_9. This was a very high abundant OTU, but you can see how the relative abundance distribution between treatment and control gradients is quite different.

I hope these help.