HCBravoLab / metagenomeSeq

Statistical analysis for sparse high-throughput sequencing
64 stars 20 forks source link

cumNormStatFast returning default value #16

Closed CarlyMuletzWolz closed 9 years ago

CarlyMuletzWolz commented 9 years ago

Hi, I am using the github version of MetagenomeSeq

install_github("nosson/metagenomeSeq") library("metagenomeSeq")

and I am sharing my R file with someone else to use MetagenomeSeq. I am re-running my code to make sure everything makes sense. A .biom file that I normalized last month that worked now is returning a warning message that it did not last month.

mybiom = load_biom("uparse_SMP_clean.biom") p = cumNormStatFast(mybiom) Warning message: In cumNormStatFast(mybiom) : Low quantile estimate. Default value being used.

p [1] 0.5

I'm confused on what is happening. Do you have any insight?

jnpaulson commented 9 years ago

Hi CarlyRae,

Thanks for your interest.

Your previous version of metagenomeSeq was most likely 1.7.0 or lower, if you check our news file: http://www.bioconductor.org/packages/release/bioc/news/metagenomeSeq/NEWS we made a change for the default minimum quantile estimate in version 1.7.10.

Using cumNorm there is an option to predefine your 'p' to be whatever value you'd like.

CarlyMuletzWolz commented 9 years ago

How do you know what you are supposed to predefine your 'p' as?

I thought the reason to use this package was for the program to calculate that normalization factor for you? I thought (p = cumNormStatFast (mybiom)) was finding a value by which your samples on average share the same microbiome structure. How do I figure that out, if I don't meet the minimum quantile estimate that is now defined in the new package?

Then from that average calculated as p, you run (mybiom = cumNorm(mybiom, p = p)) which then determines how much each of your samples have of that average community.

Then, you run (normmybiom <- MRcounts(mybiom, norm = T)) so that you multiply each of the OTUs in a sample by the proportion of the 'core community' that sample has.

Is that what is happening, or am I wrong? And, again how do I figure out that original normalization factor, without just guessing at it?

Thanks and happy new year!

Carly

jnpaulson commented 9 years ago

@CarlyRae - p is chosen according to the cumNormStat/cumNormStatFast functions. If it is deemed too low a value then we recommend the median.

I do not recommend this, but you can take a look at the cumNormStat/cumNormStatFast code and modify it by removing the following lines in your own function: https://github.com/nosson/metagenomeSeq/blob/master/R/cumNormStat.R#L59-L62

CarlyMuletzWolz commented 9 years ago

Hi, I'm coming back to this again. I am still trying to understand what is going on when running the cumNormStatFast command and a few other commands dealing with normalization. You did not specifically address this in your last reply.

Could you please in layman's term or preferably use an example to explain what these commands are doing:

(p = cumNormStatFast (mybiom)) (mybiom = cumNorm(mybiom, p = p)) (normmybiom <- MRcounts(mybiom, norm = T))

Here are my guesses:

You find all the OTUs that have nonzero counts in each sample. You make a matrix that has all the samples and all the OTUs that have nonzero counts in ALL samples. Using this data you calculate the percentile of OTU counts that are present in all of your samples compared to all OTU counts. Say there are 500 OTUs, only 20 have non-zero counts in all of your samples, then you count up those 20 OTU counts and divide by the 500 OTU counts. You find that 100,000 counts of 300,000 counts are those relatively invariant counts. So you would have 0.33 as your p. If you had this value cumNormStatFast returns the warning message In cumNormSTatFast(mybiom): Low quantile estimate. Default value being used.

The default value you specified in your last reply was the median. So, if all the samples do not have 50% of the same 'invariant' OTUs counts then the default specifies that 0.5 (the median) should be the percentile for which to scale counts by.

Then from that value calculated as p, you run (mybiom = cumNorm(mybiom, p = p)) which then determines how much each of your samples have of that p community? The invariant community. So, if in reality it was 0.33, you are forcing p to be set at the median. What is the median? You are looking across ever OTU and finding it's median count? Or only the median count for the non-zero count OTUs across all the samples?

I run (head(normFactors(mybiom)) and get

38B4 THA6 38B3 38B2 THA2 THA14 60 48 70 70 64 41

So, 38B4 has 60% of that median community. THA6 has 48% of that median community?

Then, you run (normmybiom <- MRcounts(mybiom, norm = T)) so that you multiply each of the OTUs in a sample by the percentile (p) of the 'core community' that sample has.

There are a lot of questions here. Any insight would be greatly appreciated.

Cheers,

Carly

jnpaulson commented 9 years ago

@CarlyRae Hi, I was wondering if you were able to find the methods section of the metagenomeSeq paper describing CSS normalization? Here is a link: http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2658.html

I will adjust the documentation to point users to the paper from now on as I think it gives a clear explanation of the method.

Your description is wrong, but I highlight below the parts of the methods section that you can read to understand what the code is performing.

p = cumNormStatFast(obj) see paragraph starting with:" We use an adaptive, data-driven method to determine ˆl based on the observation above. We find a value ˆl where sample-specific count distributions deviate from an appropriately defined reference distribution." This calculates the percentile, ^l. If you are able to read code: cumNormStat will be easier to read

obj = cumNorm(obj, p = p) see paragraph re: "We introduce a new normalization method..." This calculates the sample scaling factors for by summing counts up to and including ^l.

normalizedMatrix <- MRcounts(obj, norm = TRUE) This scales each sample by the scaling factors producing a new matrix.

CarlyMuletzWolz commented 9 years ago

I have read the paper and methods and metagenomeSeq .pdf. The problem is I am a biologist that is trying to use your methods. I am trying to understand the text, but as you can see I do not understand it. I would like to use your methods and share with our people, but I do not understand the mathematical language. If you could please provide a small example using real numbers it would be greatly appreciated.

jnpaulson commented 9 years ago

@CarlyRae I would be happy to provide an example. Please give me a few days, I'm working on finishing the dissertation and will respond asap.

CarlyMuletzWolz commented 9 years ago

Thank you and no rush! Good luck with the dissertation!

werdnus commented 9 years ago

I'd like to pipe in and echo @CarlyRae: it'd be really nice to have a relatively simple example for the biologists trying to use your methods. I've found the documentation for the various functions in the metagenomeSeq package frustratingly vague, and the explanation of the methods in the Nature Methods paper frustratingly dense.

Good luck dissertating!

aravenscraft commented 9 years ago

I am getting the same error message: "Warning message: In cumNormStatFast(obj) : Low quantile estimate. Default value being used."

I'd like to echo CarlyRae and werdnus's requests for a simple example for biologists. I have also read carefully through the Nature Methods paper and the vignette, and I've read the code for the function cumNormStat, as well. I'm still confused how exactly the calculations are working. (Despite working through all these materials, I am still finding the definition for l-hat particularly hard to wrap my head around; it keeps slipping out of my head. A simple example using real numbers would help me grasp it better.)

A related, higher-level question: If it's not possible to estimate appropriate quantiles for our data, should we still be trying to use this package to normalize our data? The vignette says "The choice of the appropriate quantile given is crucial for ensuring that the normalization approach does not introduce normalization-related artifacts in the data." So I am especially uncomfortable trying to guess what quantile to use, or just taking the median of the "Quantile value" column from the output of exportStats (as I think was recommended above).

jnpaulson commented 9 years ago

Hi @aravenscraft,

Thank you for the message and interest. Just to clarify, this is not an error and I will fix that in the code to reflect that. It should not be saying warning.

We do recommend you CSS normalization and we choose the median value as we find it to work very well in practice and have run experiments described in a publication forthcoming to explain the reasoning. This will clarify the point.

Best,

aravenscraft commented 9 years ago

Ok, thank you! That's good to know!

Hope the dissertation writing went/is going well!

jeffkimbrel commented 8 years ago

Hi all,

Has this "simple example for biologists" been updated anywhere? I too am (a biologist) having trouble getting going with this package, either normalizing an OTU table or preferably a Phyloseq object.

Thanks.

CarlyMuletzWolz commented 8 years ago

To my knowledge - no. @jnpaulson, it would be great to see such an example!

jeffkimbrel commented 8 years ago

I am walking through the vignette, and have successfully created the MRexperiment object called obj from section 2.5 using the provided files.

But, when I get to section 3.1, I can get cumNormStatFast to work with the lungData object, but not with my obj object. They are both the same type of object:

> typeof(obj)
[1] "S4"
> typeof(lungData)
[1] "S4"

However, I get the error that there are "samples with one or zero features":

> cumNormStatFast(obj)
Error in cumNormStatFast(obj) : Warning sample with one or zero features
> cumNormStatFast(lungData)
[1] 0.7014946

I haven't attempted to use my own files yet, since I can't seem to get the object I created to work using the provided files.

I know this is getting off-topic (especially for a closed thread), let me know if I should open a new issue.

hcorrada commented 8 years ago

The data used in Sec. 2 is, as stated in the vignette, is a portion of the complete dataset used in Sec. 3. You can verify the number of features in each is different.

Also, you should consider posting questions like these at http://support.bioconductor.org