caporaso-lab / student-microbiome-project

Central repository for data and analysis tools for the StudentMicrobiomeProject.
9 stars 3 forks source link

Which body habitats are most/least variable through time? #2

Open floresg opened 11 years ago

floresg commented 11 years ago

A. Alpha diversity a) Metrics – richness, phylogenetic diversity, Shannon Index) 1) Coefficient of variation (CV = standard deviation/mean) – useful to compare the variation of two populations independent of the magnitude of their means. b) Look for difference within each body habitat based on:

B. Beta diversity a) Metrics – weighted/unweighted UniFrac 1) metrics that contain abundance information are more appropriate for these data because skin habitats are rich in low abundance transient otus which will be more heavily weighted using a presence/absence metric 2) median absolute deviation (MAD) – not sensitive to outliers 3) mean of pairwise comparisons

floresg commented 11 years ago

Question about the Beta diversity part of this analysis - instead of averaging all the pairwise comparisons for an individual, should we average only those from adjacent time points?

gregcaporaso commented 11 years ago

Added some data showing a comparison across individuals. See the analysis results here. Working on within individual comparisons now.

floresg commented 11 years ago

One question I had about these analyses is normalizing sampling effort across individuals since some people have only 5 samples and others have up to 14? If there is a time distance decay relationship, then you would expect individuals who turned in samples further apart in time would have greater variability than those that turned in samples closer in time. Should we be randomly sampling five samples from each individual for these analyses?

rob-knight commented 11 years ago

I would recommend doing some matched analyses testing the effect vs the subset of subjects who returned all samples (ie compare vs same 5 timepoints from subjects with all timepoints).

On Nov 9, 2012, at 3:12 PM, "floresg" notifications@github.com<mailto:notifications@github.com> wrote:

One question I had about these analyses is normalizing sampling effort across individuals since some people have only 5 samples and others have up to 14? If there is a time distance decay relationship, then you would expect individuals who turned in samples further apart in time would have greater variability than those that turned in samples closer in time. Should we be randomly sampling five samples from each individual for these analyses?

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10246459.

rob-knight commented 11 years ago

Thanks. Those are extremely significant t test values and I bet all the nonparametric values are 0 even if you do 10^9 iterations.

The fact that forehead is lower diversity/lower variability than palm was known in Costello et al. though not sure we reported it clearly.

It might be worth reopening the discussion about which measures of variability are useful and how we should apply and compare them?

On Nov 9, 2012, at 1:27 PM, Greg Caporaso notifications@github.com<mailto:notifications@github.com> wrote:

Added some data showing a comparison across individuals. See the analysis results herehttps://github.com/gregcaporaso/student-microbiome-project/wiki/Overall-variability-across-body-sites. Working on within individual comparisons now.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10243444.

gregcaporaso commented 11 years ago

From Rob's comment:

It might be worth reopening the discussion about which measures of variability are useful and how we should apply and compare them?

This is something that @jrrideout is actively working on for the microbiogeo analysis/paper and we'll feed the results into this analysis.

antgonza commented 11 years ago

I think that one of the most important questions we need to answer is what is best wat to characterize variation in bacterial communities: mean or median. Now, I'm not sure this is the perfect dataset to do this but it will be good to keep it in mind while selecting analytical tools.

On Sat, Nov 10, 2012 at 8:15 AM, Greg Caporaso notifications@github.comwrote:

From Rob's comment:

It might be worth reopening the discussion about which measures of variability are useful and how we should apply and compare them?

This is something that Jai is actively working on for the microbiogeo analysis/paper and we'll feed the results into this analysis.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10256088.

Antonio González Peña Research Assistant, Knight Lab University of Colorado at Boulder https://chem.colorado.edu/knightgroup/ http://scholar.google.com/citations?user=d5EXd78AAAAJ

gregcaporaso commented 11 years ago

I think just mean/median is not enough, but rather a five number summary - minimum, first quartile, median, third quartile, and maximum - would be better. Alternative would be median and median absolute deviation. Thoughts on this?

I really don't like mean for this for the usual sensitivity to outliers reason, which can be pop up here all the time e.g. if someone sneezed on their hands a couple of mins before sampling at one of the time points (while these would look different, probably not different enough to be flagged as mislabeled).

gregcaporaso commented 11 years ago

From Rob's comment:

I would recommend doing some matched analyses testing the effect vs the subset of subjects who returned all samples (ie compare vs same 5 timepoints from subjects with all timepoints).

One relatively minor issue here is that we don't currently define what it means for someone to have turned in all samples. Technically the sampling period was 10 weeks, but if people get providing samples, we kept taking them, so we have up to ~13 weeks of data from some individuals. Gilbert/Dan, you're most familiar with the metadata - would we be safe defining 10 weeks as "all"? If so, does anyone object to that definition?

floresg commented 11 years ago

We may want to define all as 8 weeks worth of samples because then more individuals will be included. One other thing to consider is consecutive time points. For some individuals those 8 samples could have been turned in over a 14 week period.

rob-knight commented 11 years ago

Sounds reasonable. If you're worried about outliers might it be worth looking at histograms of some/all of the distributions eg as thumbnails?

On Nov 11, 2012, at 8:54 AM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

I think just mean/median is not enough, but rather a five number summary - minimum, first quartile, median, third quartile, and maximum - would be better. Alternative would be median and median absolute deviation. Thoughts on this?

I really don't like mean for this for the usual sensitivity to outliers reason, which can be pop up here all the time e.g. if someone sneezed on their hands a couple of mins before sampling at one of the time points (while these would look different, probably not different enough to be flagged as mislabeled).

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10268534.

antgonza commented 11 years ago

I guess my comment wasn't clear enough. My concern between mean/median is due to the use/introduction of median absolute deviance (MAD) vs. the histograms/mean we have used before for other analyses and I just do not want this point get lost.

On Sun, Nov 11, 2012 at 12:21 PM, Rob Knight notifications@github.comwrote:

Sounds reasonable. If you're worried about outliers might it be worth looking at histograms of some/all of the distributions eg as thumbnails?

On Nov 11, 2012, at 8:54 AM, "Greg Caporaso" <notifications@github.com mailto:notifications@github.com> wrote:

I think just mean/median is not enough, but rather a five number summary - minimum, first quartile, median, third quartile, and maximum - would be better. Alternative would be median and median absolute deviation. Thoughts on this?

I really don't like mean for this for the usual sensitivity to outliers reason, which can be pop up here all the time e.g. if someone sneezed on their hands a couple of mins before sampling at one of the time points (while these would look different, probably not different enough to be flagged as mislabeled).

— Reply to this email directly or view it on GitHub< https://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10268534>.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10270959.

Antonio González Peña Research Assistant, Knight Lab University of Colorado at Boulder https://chem.colorado.edu/knightgroup/ http://scholar.google.com/citations?user=d5EXd78AAAAJ

gregcaporaso commented 11 years ago

I think the histograms cover what we'd show in a five number summary. I think you're saying that it'd be worth mentioning in the paper why we're choosing to used median, etc rather than mean - is that right? I agree that that's a good technical point to mention.

Also, I wanted to point out that Jai is working on a subsampling strategy relevant for time series analysis to address Gilbert's suggestion for subsampling. We're discussing this here, and he is shooting to have a function in place that we could use to explore this by the end of this week.

rob-knight commented 11 years ago

There are two separate points here:

  1. mean vs median for comparisons of distances
  2. whether to use a measure of central tendency (mean or median or whatever) or a measure of spread (standard deviation or MAD or whatever)

In both cases, comparison and discussion would probably be a good idea.

Rob

On Nov 13, 2012, at 5:40 PM, Greg Caporaso notifications@github.com<mailto:notifications@github.com> wrote:

I think the histograms cover what we'd show in a five number summary. I think you're saying that it'd be worth mentioning in the paper why we're choosing to used median, etc rather than mean - is that right? I agree that that's a good technical point to mention.

Also, I wanted to point out that Jai is working on a subsampling strategy relevant for time series analysis to address Gilbert's suggestion for subsampling. We're discussing this herehttps://github.com/qiime/qiime/issues/446, and he is shooting to have a function in place that we could use to explore this by the end of this week.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10350601.

floresg commented 11 years ago

Besides the moving pictures data and infant gut time-series, the other human microbiome time series studies involve the vagina and nares. Both used different metrics to quantify beta diversity variability. In the vaginal paper, they used the median of Jensen-Shannon divergence to represent "community deviation from constancy." The supplemental section of this manuscript describes this metric but it sounds like it is just another metric based on entropy. They do provide justification of this choice but it is not very clear. The nares paper used the index of multivariate dispersion (IMD) to measure "the variability of an individuals bacterial community structure among the months." I did a little digging on this metric but could not find anything very helpful. These two metrics might be something we want to look into for our work and at least should start a constructive conversation. I am not sure how to add the papers to GitHub so I will send them to Greg and maybe he can add them to my comment here?

gregcaporaso commented 11 years ago

Here are links to those two papers: Camarinha-Silva (2012) and Gajer (2012).

rob-knight commented 11 years ago

We collaborate with Jacques/Pawel so let me know if methods clarifications needed: Jacques and I are on the NIH call right after Fri meeting so I can bug him then...

On Nov 14, 2012, at 9:02 PM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

Here are links to those two papers: Camarinha-Silva (2012)http://onlinelibrary.wiley.com/doi/10.1111/j.1758-2229.2011.00313.x/full and Gajer (2012)http://sciencemedicine.org/content/4/132/132ra52.short.

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10396851.

floresg commented 11 years ago

Added beta diversity dotplots for average values and MAD. For unweighted UniFrac, the results agree with Greg's boxplots and statistical analysis, that is variability of palm > forehead > gut > tongue. However, weighted UniFrac and MAD tell a different story.

gregcaporaso commented 11 years ago

@floresg is going to look specifically at what was previously issue #8 here (Are individuals that reported having atopic diseases (allergies, asthma, eczema, etc) more or less stable than those that did not? Diversity higher or lower?)

floresg commented 11 years ago

I have added some text and tables but am still working on this issue.