HCBravoLab / metagenomeSeq

Statistical analysis for sparse high-throughput sequencing
64 stars 20 forks source link

What is logFC reporting in MRcoefs output? #31

Closed CarlyMuletzWolz closed 8 years ago

CarlyMuletzWolz commented 8 years ago

Hello,

I am trying to understand what the values are that are being reported in logFC after using fitFeatureModel. My code is below.

I have a metagenomeSeq object called Park_ShenCat that looks like this:

Park_ShenCat MRexperiment (storageMode: environment) assayData: 48 features, 33 samples element names: counts protocolData: none phenoData sampleNames: RSB2 LSB7 ... FDR.54 (33 total) varLabels: X.SampleID Run_no ... Humidity (34 total) varMetadata: labelDescription featureData featureNames: OTU_21 OTU_38 ... OTU_40 (48 total) fvarLabels: OTUname fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation:

pd_shencat <- pData(Park_ShenCat) levels(pd_shencat$Park) [1] "Catoctin" "Shenandoah"

mod_shencat <- model.matrix(~Park, data = pd_shencat) modres5 = fitFeatureModel(Park_ShenCat, mod_shencat) Warning messages: 1: Partial NA coefficients for 3 probe(s) 2: glm.fit: fitted probabilities numerically 0 or 1 occurred 3: glm.fit: algorithm did not converge 4: glm.fit: fitted probabilities numerically 0 or 1 occurred

I am assuming that these warning messages indicate that some of these OTUs do not have enough information for the model to converge so I get an error, is this correct?

y_5 <- MRcoefs(modres5, number = 10, group = 3) y_5 logFC se pvalues adjPvalues OTU_449 4.493044 0.5524996 4.440892e-16 1.687539e-14 OTU_249 2.331554 0.3086840 4.241052e-14 8.057999e-13 OTU_165 -2.575309 0.3446273 7.860379e-14 9.956480e-13 OTU_8 -2.269474 0.3238382 2.416733e-12 2.295897e-11 OTU_5 3.829128 0.5488604 3.026024e-12 2.299778e-11 OTU_79 2.424200 0.4239055 1.073252e-08 6.797260e-08 OTU_2 -2.715030 0.6008155 6.215932e-06 3.374363e-05 OTU_19 -2.536399 0.5824391 1.331956e-05 6.326791e-05 OTU_59 -1.328322 0.3375506 8.313507e-05 3.510147e-04 OTU_3 -2.452399 0.6573659 1.909904e-04 6.603555e-04

Now here is my question -- For OTU_2 it says there is a -2.71 logFC from Catoctin to Shenandoah (my levels of my comparison).

I assume that is log2 fold change since that is what most RNAseq folks use, but I couldn't find any documentation of that.

Now, I can go look at OTU_2 and I know it is much more abundant at Catoctin (level 1) compared to Shenandoah (level 2), but why is logFC reporting it as a negative number?

Also, is this the log fold change that log2(sum of the normalized counts for OTU_2 at Catoctin divided by the sum of the normalized counts for OTU_2 at Shenandoah)??

Thank you for your time.

jnpaulson commented 8 years ago

Hi Carly,

Thanks - I should mention, since this isn't an issue really it might be better to post these types of questions on support.bioconductor.org.

There are a number of edX courses or intro stat books that can explain the basics to linear models. A nice intro is available here: https://leanpub.com/dataanalysisforthelifesciences/ for free.

In your model, the first covariate estimate, the intercept, is the mean for Catoctin. The mean for OTU_2 drops from Catoctin by -2.71 for Shenandoah. So your intuition is correct for what can be shown mathematically. I suggest reviewing or going to your favorite local statistician for advice.

Re: q2. Difference of means of log transformed data.

CarlyMuletzWolz commented 8 years ago

Hi,

Thank you for the quick response.

I understand linear models. I have several basic and advanced statistics books sitting on my desk that I reference often.

My question is directed towards what logFC is representing. It is not a normal output of a linear model. I now understand that the output is taking the level 1 factor and comparing it to the level 2 factor, indicating how level 2 value is compared to level 1. This is similar to the general output of a linear model, indicating what the intercept is -- level 1 -- and how the other levels compare to the intercept.

My final question still remains on how is metagenomeSeq specifically calculating logFC. I find this to be a question specific for you all.

I know some people use log2 fold change and I could not find what you all are using. I feel that it is a documentation issue. You are reporting a value in a table with no indication of what the value represents in any metagenomeSeq help file. That is why I am asking it here on github. I can switch over to Bioconductor if need be.

Your answer is: it is the ln(mean of OTU_X at factor 2) - ln(mean of OTU_X at factor 1)?? (note: as log in R is the natural log)

Is the mean of the OTU that of the normalized counts or the raw counts?

From calculating it myself it appears that it is using the normalized counts, but I could be wrong.

So the answer is: ln(mean of normalized(OTU_X at factor 2)) - ln(mean of normalized(OTU_X at factor 1)

When I extract these values for OTU_2 I get -2.77 so not too off from the -2.71 calculated (probably due to significant figures).

Then, to figure out what logFC actually means in terms of biology. I take the e^(-2.77) in the case of OTU_2 and get the value 0.0626. If I subtract 1 then that gives me the change from factor 1 to factor 2 = -0.9374.

In my case, the average count (raw or normalized?) of OTU_2 decreased by 93.74% from Catoctin to Shenandoah.

If the value is positive, for instance OTU_79 has a value of 2.42. I take e^(2.42) and get the value 11.246. If I subtract 1 then I get 10.246. So, the average count (raw or normalized?) of OTU_79 has increased by 1024% from Catoctin to Shenandoah.

I played around with the data to figure out what the logFC is really telling me biologically, but couldn't exactly tell whether it was raw or normalized counts? In my case OTU_2 raw gave me logFC = -3.02 and OTU_2 normalized gave me logFC = -2.77. Then OTU_79 raw game me logFC = 2.40 and OTU_79 normalized gave me logFC = 2.29.

Thank you and I look forward to the publication on the time series analysis. I have a time series experiment that I plan to analyze using metagenomeSeq. Still waiting for a simple biological example on how normalization works (https://github.com/HCBravoLab/metagenomeSeq/issues/16). I can repost at bioconductor if that will help the practicing biologist get an explanation. Or, you mention a publication may be coming out that will help a biologist understand what they are doing to their data, so they can explain to other biologist and help others use your method.

Cheers,

Carly

hcorrada commented 8 years ago

Dear Carly,

Thank you for your use of metagenomeSeq, it is much appreciated. We are always keen on improving the usability and usefulness of this package so your comments are welcome. We could of course, always use help in translating what our method does to biological statements as much as possible.

So, as the Nature Methods paper describing this method explains, the inferences we produce are from a weighted linear regression on log2 transformed and normalized counts. In fact, in order to not redesign the wheel, we make use of the BioC Limma package to fit the weighted linear regression models, and the estimates reported in metagenomeSeq are directly obtained from that. So, as in that case, you can obtain fold change from 2^logFC. So, in the simple two-level factor design, it would be a difference in mean of log2 normalize counts as you write, except for the fact that weights are included to correct for zero-inflation.

HTH

CarlyMuletzWolz commented 8 years ago

Great, thank you for the reply. It is great the work you all do.

That clarifies things for me as to what logFC is and also why the values I was calculating were a bit off from the values that I found in the output.

The nature methods paper is dense for a biologist, especially what exactly css normalization is doing. A biological example with a small dataset say 4 samples by 8 OTUs explaining how the normalization constant, the median scaling factor, and the normalization scaling factor is calculated or chosen, and how they are related and how that actually relates to the code one uses in R. By using an example, it is much easier for a biologist to track what is happening. The written text is challenging to grasp.

I am at UMD, and am happy to meet and discuss anytime.

CarlyMuletzWolz commented 7 years ago

I wanted to add one more comment here that clarifies how fitFeatureModel calculates the logFC as shown in the MRcoefs output and how you can take the value and actually interpret biologically (as someone asked me recently about this).

The calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1)

######## CODE ###########

Create the metagenomeSeq object

MGS_ele = newMRexperiment(counts = OTU, phenoData = ADF, featureData = TDF)

p = cumNormStatFast(MGS_ele) MGS_ele= cumNorm(MGS_ele, p =p)

returns normalized factors for each sample

head(normFactors(MGS_ele))

To export normalized count matrix

I have not spent time figuring out how to extract the normalized counts corrected for zero-inflation, but it is straightforward to extract the normalized counts.

normmybiom <- MRcounts(MGS_ele, norm = T) write.csv(normmybiom, "elevation_normalized.csv")

To use fitFeaturemodel can only do pairwise comparisons. I had three levels of interest, so did each comparison. Below is for comparing two localities. MetagenomeSeq has fitZig function that is supposed to be able to do multiple comparisons, but I got results that did not appear correct based on the data, so I just stuck with fitFeatureModel

Can only be pairwise comparisons with fitFeatureModel

subset Catoctin, Shenandoah (remove Mt. Rogers samples)

Park_ShenCat = MGS[, -which(pData(MGS)$Park == "Mt.Rogers")] s_shencat <- normFactors(Park_ShenCat) pd_shencat <- pData(Park_ShenCat) levels(pd_shencat$Park)

Here are the levels: > levels(pd_shencat$Park)

[1] "Catoctin" "Shenandoah"

mod_shencat <- model.matrix(~Park, data = pd_shencat) modres5 = fitFeatureModel(Park_ShenCat, mod_shencat)

y_5 <- MRcoefs(modres5, number = 30, group = 3) y_5

OUTPUT:

         logFC        se      pvalues   adjPvalues

OTU_449 4.4101969 0.5785806 2.486900e-14 1.442402e-12 OTU_249 2.3669049 0.3364361 1.989298e-12 5.768963e-11 OTU_165 -2.2933512 0.3297663 3.538725e-12 6.841535e-11 OTU_5 3.9683826 0.6273578 2.523335e-10 3.658836e-09 OTU_79 2.7597166 0.4621748 2.355814e-09 2.732744e-08 OTU_8 -1.9867328 0.3538934 1.977820e-08 1.911893e-07 OTU_19 -2.4174034 0.6202862 9.729571e-05 8.061644e-04

Now we want to interpret the OTUs that are significant -- updated July 2017

MATHEMATICALLY -- You can then calculate the mean of the normalized counts for each OTU at each factor level (extracted in code above). BUT, what logFC is calculating is based on the mean of normalized counts corrected for zero-inflation. I did not figure out how to extract those values from the model, but I'm sure there is some way if you really wanted to.

Again, the calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1)

I could never get the same answer in excel as metagenomeseq indicates at logFC, but I am assuming this is because I was not extracting the counts before zero-inflation correction by the model. I am NOT a developer of the package, just a user trying to figure it out.

There is probably some way to do it in R. I really just did those calculations to try to figure out what the logFC was indicating and how to interpret as percent change.

BIOLOGICALLY -- you can just take the calculated logFC and do a little math to make sense of the number (Percentage change = 100% x ((2^ log2 diff) - 1)). For instance OTU_19 above has a logFC = -2.42. I take the 2^(-2.42) in the case of OTU_2 and get the value 0.19 (In excel = Power(2, -2.42)). Then, subtract 1 (0.19 - 1) = change from factor 1 to factor 2 = -0.81. the average normalized zero-inflated corrected count of OTU_19 decreased by 81% from Catoctin to Shenandoah.

If the value is positive, for instance OTU_79 has a value of 2.76. I take 2^(2.76) and get the value 6.77. If I subtract 1 (6.77 - 1) then I get 5.77. So, the average normalized zero-inflated corrected count of OTU_79 has increased by 577% from Catoctin to Shenandoah.

handibles commented 7 years ago

Can I ask for clarification:

@CarlyRae mentions above using ln, i.e. the natural log or log(e)X to interrogate her data. Additionally, calling '?log' explains that 'log' defaults to log(e)X in R.

From the metagenomeSeq documentation, all functions which use or report log transformations or fold changes seem to use log(2)X (binomial log) to improve standardisation of the data; additionally @hcorrada above indicates one would use 2^logFC to obtain the original fold change (but for weightings etc.). A quick search suggests log2 is standard for fold-change expression.

In the absence of an obvious convention for log(e)X in fold-change expression, could the authors confirm which logarithm is used for the reported fold-change given by MRcoefs etc ?

Apologies for not asking elsewhere, thought it makes most sense to append here in light of CarlyRae's thorough workthrough above.

jnpaulson commented 7 years ago

It follows the conventions of the package documentation. The documentation in the package explicitly mentions log2.

MetagenomeSeq and many other very popular differential expression softwares use log2 in part because of the double-ing of DNA content during the PCR process.

On 2017-03-21 08:58, handibles wrote:

Can I ask for clarification:

@CarlyRae [1] mentions above using LN, i.e. the NATURAL LOG or LOG(E)X to interrogate her data. Additionally, calling '?log' explains that 'log' defaults to log(e)X in R.

From the metagenomeSeq documentation, all functions which use or report log transformations or fold changes seem to use LOG(2)X (BINOMIAL LOG) to improve standardisation of the data; additionally @hcorrada [2] above indicates one would use 2^LOGFC to obtain the original fold change (but for weightings etc.). A quick search suggests log2 is standard for fold-change expression.

In the absence of an obvious convention for log(e)X in fold-change expression, could the authors confirm which logarithm is used for the reported fold-change given by MRcoefs etc ?

Apologies for not asking elsewhere, thought it makes most sense to append here in light of CarlyRae's thorough workthrough above.

-- You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub [3], or mute the thread [4].

Links:

[1] https://github.com/CarlyRae [2] https://github.com/hcorrada [3] https://github.com/HCBravoLab/metagenomeSeq/issues/31#issuecomment-288070211 [4] https://github.com/notifications/unsubscribe-auth/AC5hg7EyPm0ZFNaTiXgdR0pUrMFnQBuMks5rn8l5gaJpZM4HbpMC

CarlyMuletzWolz commented 7 years ago

Well, that might explain why my math was always a little off when I was trying to calculate the logFC by hand. I thought it was off because of the zero-inflated part that the model does as Hector had indicated when I first asked how it was calculated. I will look over this and make any corrections in my above description based on your correction about it being log2 and not ln.

chloelulu commented 7 years ago

@jnpaulson @hcorrada @CarlyRae Hi, I am also using this package, and I can not understand how to explain the result. Could you do me a favor in explaining the result? I would be really appreciated. Here is my result. I have a lot of p value that greater than 0.05. So my main questions are. 1) Is there a need to exclude all the genes with pvalue>0.05? OR adjpvalues>0.05 2) What does logFC explain? eg, for gene_Alcaligenes, if the logFC=1.23, does it mean the log2 change of gene_Alcaligenes? And how this number explains? The same way as @CarlyRae explained above, transfer the logFC into % change? I am really frustrated. Thanks in advance.

                               logFC        se      pvalues   adjPvalues

Alcaligenes 1.2377366 0.2355239 1.478256e-07 0.0000190695 Smithella -0.9190338 0.2570847 3.504570e-04 0.0226044745 Anaerovorax -0.7757875 0.2426930 1.390675e-03 0.0597990212 Carnobacterium 1.0227220 0.3366185 2.379780e-03 0.0718963118 Planococcaceae_incertae_sedis 1.5204578 0.5084566 2.786679e-03 0.0718963118 Caldilinea -0.7391929 0.2638856 5.091507e-03 0.1094674076 Alistipes 0.8456538 0.3256765 9.414956e-03 0.1735041872 Syntrophorhabdus -0.6628266 0.2678901 1.335167e-02 0.1983488453 Thauera -1.0037165 0.4077772 1.383829e-02 0.1983488453 Rheinheimera -1.8639262 0.7807375 1.696840e-02 0.2188924082 Bellilinea -0.5252676 0.2294600 2.207085e-02 0.2588308349 Aciditerrimonas -0.6620480 0.3115342 3.357649e-02 0.3609472875 Blastopirellula -0.8330229 0.4112121 4.278789e-02 0.4117321334 Syntrophus -0.4617187 0.2299835 4.468411e-02 0.4117321334 Acidaminococcus 0.5663348 0.3000884 5.912989e-02 0.4991272992 Erysipelothrix 0.9500383 0.5127024 6.388213e-02 0.4991272992 Bacteroides 0.5262665 0.2860231 6.577647e-02 0.4991272992 Methanosaeta -0.3807460 0.2185914 8.154054e-02 0.5380502143 Sporobacter -0.9147967 0.5269766 8.257546e-02 0.5380502143 Truepera 0.4558947 0.2649510 8.530974e-02 0.5380502143 Acholeplasma 0.6343396 0.3713384 8.758957e-02 0.5380502143 Dethiosulfovibrio -0.5094983 0.3042239 9.398369e-02 0.5424726542 Azoarcus -0.6361008 0.3829700 9.671993e-02 0.5424726542 Pseudochrobactrum -0.6608524 0.4200431 1.156501e-01 0.6004000292 Proteiniclasticum -0.3733107 0.2377394 1.163566e-01 0.6004000292 Psychrobacter -0.3379863 0.2191402 1.229937e-01 0.6102377839 Clostridium_XlVa 0.7896000 0.5279207 1.347373e-01 0.6388845029 Gp18 -0.3817384 0.2578097 1.386870e-01 0.6388845029 Paludibacter -0.8605732 0.5884569 1.436252e-01 0.6388845029 Sedimentibacter -0.3854983 0.2851934 1.764685e-01 0.7347565557

CarlyMuletzWolz commented 7 years ago

I updated my workflow above. It would be amazing if one of the developers of the package could confirm the mathematical and biological intrepretation of logFC as I have written below. @hcorrada @jnpaulson

Now we want to interpret the OTUs that are significant -- updated July 2017

MATHEMATICALLY -- You can then calculate the mean of the normalized counts for each OTU at each factor level (extracted in code above). BUT, what logFC is calculating is based on the mean of normalized counts corrected for zero-inflation. I did not figure out how to extract those values from the model, but I'm sure there is some way if you really wanted to.

Again, the calculation behind logFC for each OTU is: log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 2)) - log2(mean of normalized count corrected for zero-inflation(OTU_X at factor 1)

I could never get the same answer in excel as metagenomeseq indicates at logFC, but I am assuming this is because I was not extracting the counts before zero-inflation correction by the model. I am NOT a developer of the package, just a user trying to figure it out.

There is probably some way to do it in R. I really just did those calculations to try to figure out what the logFC was indicating and how to interpret as percent change.

BIOLOGICALLY -- you can just take the calculated logFC and do a little math to make sense of the number (Percentage change = 100% x ((2^ log2 diff) - 1)). For instance OTU_19 above has a logFC = -2.42. I take the 2^(-2.42) in the case of OTU_2 and get the value 0.19 (In excel = Power(2, -2.42)). Then, subtract 1 (0.19 - 1) = change from factor 1 to factor 2 = -0.81. the average normalized zero-inflated corrected count of OTU_19 decreased by 81% from Catoctin to Shenandoah.

If the value is positive, for instance OTU_79 has a value of 2.76. I take 2^(2.76) and get the value 6.77. If I subtract 1 (6.77 - 1) then I get 5.77. So, the average normalized zero-inflated corrected count of OTU_79 has increased by 577% from Catoctin to Shenandoah.

chloelulu commented 7 years ago

@CarlyRae Hi, Carly, thanks so much for your quick reply. I have read your post for several days. Could you help me explain the question on the p-value. Can I filter the final results based on the p-value <0.05? For my situation. I got the result as I posted above. But I do not know to trust which number? Can I trust all the numbers that have the p-value<0.05? Or is there another threshold I can use, like the logFC has a threshold? Thanks in ahead.

CarlyMuletzWolz commented 7 years ago

I only interpreted adjusted p-values < 0.05. The p-values are adjusted for multiple comparisons using FDR corrections (see MRcoefs help file in R). I filtered my original OTU file to contain OTUs that were present on 5% of individuals.

Here is methods from a paper soon to be published in J of Animal Ecology. Will share doi once available.

To generate normalized sequence counts, we performed variance-stabilizing normalization on the raw sequence counts (Paulson et al. 2013) using the functions cumNormStatFast and cumNorm in the package ‘metagenomeSeq’ (Paulson et al. 2015). This normalization method corrects for biases associated with uneven sequencing depth (Paulson et al. 2013; McMurdie & Holmes 2014). For bacterial OTU abundance, we interpreted normalized sequence counts as bacterial abundance because sequence counts are approximately quantitative for microbial OTU abundance (Amend et al. 2010). Hereafter, we refer to normalized sequence counts as bacterial abundance. In bacterial abundance analyses, we filtered the data to contain OTUs that occurred in 5% of samples to reduce spurious significance from OTUs with low sequence counts, and reported false discovery rate (FDR) corrected p-values.

For bacterial abundance, we determined if OTUs were differentially abundant between species and between sites using zero-inflated log-normal (ZILN) mixture models (Paulson et al. 2013) in the package ‘metagenomeSeq’ (fitFeatureModel function; (Paulson et al. 2015) with the normalized sequence counts as the response variable. The ZILN mixture models remove testing biases associated with undersampling (Paulson et al. 2013). If sites within a locality had no OTUs that were differentially abundant then we pooled the sites together at the locality level to increase statistical power.

Here is the top part of my code that I didn't include before:

library(phyloseq) library(metagenomeSeq)

non-rarified

biom_site_sp <- paste("biom_sides/Site.sp_compare/otu_table_left_site_sp_json.biom", sep = "")

tree = read.tree("SMP_rooted_all_Dec2015.tre")

map <- import_qiime_sample_data("biom_sides/Site.sp_compare/SMP_runall_Map_meta__Side_left__Site.sp_compare_Y.txt")

run2 <- import_biom(biom_site_sp, tree)

site_sp <- merge_phyloseq(run2, map) site_sp

remove all zeros

site_sp = filter_taxa(site_sp, function(x) sum(x) != 0, TRUE)

repair column names in tax table

colnames(tax_table(site_sp)) = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")

rank_names(site_sp)

examine OTUs that are present on 5% of individuals, can increase depending on sampling effots. I had 62 individuals in my dataset.

site_sp_r = filter_taxa(site_sp, function(x) sum(x > 1) > (0.05*length(x)), TRUE)

I will only work with reduced dataset

OTU = as(otu_table(site_sp_r), "matrix")

Convert sample_data to AnnotatedDataFrame

ADF = AnnotatedDataFrame(data.frame(sample_data(site_sp_r)))

define dummy 'feature' data for OTUs, using their name Helps with

# extraction and relating to taxonomy later on.

TDF = AnnotatedDataFrame(data.frame(OTUname = taxa_names(site_sp_r), row.names = taxa_names(site_sp_r)))

chloelulu commented 7 years ago

@CarlyRae Awsome! Thanks so much for your reply and the code sharing. Got it!