biocore / American-Gut

American Gut open-access data and IPython notebooks
Other
108 stars 81 forks source link

Picrust #144

Closed mortonjt closed 9 years ago

mortonjt commented 9 years ago

Submitting first draft of the picrust ipython notebook.

Based a bit of the plotting on the Alpha Diversity notebooks.

A few of minor details that I'm a little stumped on

  1. How to import americangut.diversity_analysis? There is no setup.py and I can't run the diversity analysis notebooks as is. I have my own solution in my notebook, but I'm sure if this is the cleanest solution.
  2. Looks like the americangut.diversity_analysis gives slightly boxplots to the pandas boxplots. I'd like to display the sample group sizes for each bar. However, I'm not crazy about the wrapping behavior right now in the americangut boxplots. Should I submit another PR to fix this issue in americangut.diversity_analysis?

Also notice that I added a AG_100nt.zip and a zip_rural.csv. AG_100nt.zip is a zip-file for a json version of the AG_100nt biom file. Picrust right now doesn't take in the most recent biom files, so this file is necessary for this notebook for the time being.

zip_rural.csv is a text file for the Rural classifications for each zip code. I'm not doing any Rural analysis in this notebook, but I'll be doing some in the next notebook that I submit.

jwdebelius commented 9 years ago

I added the script directory to my python path. Note that you also need statsmodels for diversity_analysis.

I'm confused about the issue in diversity_analysis boxplots issue, can you clarify, please?

Also, for ease of future review, a rendered version of the notebook is here.

mortonjt commented 9 years ago

Gotcha. In that case, would my solution is good as is, since it's appending to the python path within the notebook?

The box plot issue only appears when the box plot is log scaled. If >25% of the abundances for an OTU are zero, the lower quartile appears weird, since it's log transformed.

Pandas box plots don't seem to have this problem for some reason. Not sure why. It may not be a big deal. Just looks kinda ugly. I'll upload a picture as soon as I get on my computer. On Apr 7, 2015 9:46 PM, "J W Debelius" notifications@github.com wrote:

I added the script directory to my python path. Note that you also need statsmodels for diversity_analysis.

I'm confused about the issue in diversity_analysis boxplots issue, can you clarify, please?

Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/144#issuecomment-90795369.

jwdebelius commented 9 years ago

You just need to add it to your bash_profile.

My recommendation would actually be to consider alternative plotting methods. IMO, boxplots are not ideal for log data. And, you're looking at multiple lines - multiple boxplots - per category. I'd argue that's justifiable for 1-4 plots per category, but that you may want to look at something else as the number of significant groups increase. Although, that's just my opinion.

On Tue, Apr 7, 2015 at 9:22 PM, mortonjt notifications@github.com wrote:

Gotcha. In that case, would my solution is good as is, since it's appending to the python path within the notebook?

The box plot issue only appears when the box plot is log scaled. If >25% of the abundances for an OTU are zero, the lower quartile appears weird, since it's log transformed.

Pandas box plots don't seem to have this problem for some reason. Not sure why. It may not be a big deal. Just looks kinda ugly. I'll upload a picture as soon as I get on my computer. On Apr 7, 2015 9:46 PM, "J W Debelius" notifications@github.com wrote:

I added the script directory to my python path. Note that you also need statsmodels for diversity_analysis.

I'm confused about the issue in diversity_analysis boxplots issue, can you clarify, please?

Reply to this email directly or view it on GitHub <https://github.com/biocore/American-Gut/pull/144#issuecomment-90795369 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/144#issuecomment-90798967.

mortonjt commented 9 years ago

Butter fingers ugh.

Mind if open up another PR to add setup information to the README? It may clarify how to run the notebooks to newcomers.

The data is not log - they are proportions. Just wanted to apply a log scale to it to make it easier to visualize. Here is the pandas plot

pandas

Here is the americangut.diversity_analysis plot pretty

They are plotting the exact same thing on the exact same axis - its not clear to me why they are plotting differently.

If you have other suggestions for visualizations, I'll be happy to follow up with them.

jwdebelius commented 9 years ago

The set up documentation is probably a good idea.

I'm not sure about differences in the figures. If you want to open an issue, I can look into it when I have time.

And, although I sort of hate to suggest them, because they're a non-optimal visualization method, a heat map would let you represent a lot of data very quickly. And then, individual data could be represented as scatter plots, histograms, boxplots, etc, if they're an area of focus. That way, you can show all the significant pathways in one figure. It doesn't give you as fine of detail about the proportions, but it's a quick and dirty way to look for patterns.

mortonjt commented 9 years ago

Good call about the heatmap.

I first averaged all of the samples within each month and then centered each KEGG around its mean and plotted it as shown below.

month

From this image, I'm sure that there are plenty of analyses to follow up on. What sort of analyses do you think would be most beneficial for this notebook and paper?

jwdebelius commented 9 years ago

That's more helpful, but there's still information lost in the interpretation. I'm seeing different patterns in different months (May - September seem different from November and January - March. But, it's hard to know what KEGGs are being represented. It could also be helpful to at least group the Keggs by function, or somehow indicate that grouping.

I'd also strongly encourage multiple hypothesis correction. Have you considered using group_signifigance.py, which provides that functionality? It might help you limit the data more, which might make the results easier to interpret and potentially follow-up on. Let me know if you want to discuss this more off line.

I'd like to see the heat map implemented in the notebook, if possible.

Also, a more general comment, which @wasade may be better able to address. My understanding was that we were submitting analysis notebooks for all the analyses we run, whether the notebooks be tutorial-style or not, rather than submitting one sample tutorial-style notebook?

mortonjt commented 9 years ago

Actually, I'm already running group comparisons to trim down the data. The heatmap that is displayed above is the result of trimming from >6000 keggs to ~1000 significant keggs.

Below is a heatmap with respect to level 3 pathways instead.

month_pathways

Thanks for looking over this!

jwdebelius commented 9 years ago

That is more manageable to me, although I'd be a bit concerned by the bladder cancer pathway.

I recognize that you're only looking at the 1000 significant KEGGs, but you haven't addressed the multiple hypothesis problem. Imagine randomly drawing samples from the same population repeatedly. As the number of times you draw the samples increase, so does the probability that you're going to see a difference, even if the difference doesn't exist. (The probability is equal to your critical value). Multiple hypothesis correction helps address this. That's why I still suggest using group_signifigance.py, which addresses the multiple hypothesis problem. There are alternatives, but they're a lot harder to implement.

As a general comment, I'm having trouble finding a flow to the way you've chosen to approach the analysis. There isn't a lot of structure or narrative to the notebook, so it's unclear of why you're taking the approaches you're taking. Why look at TYPES_OF_PLANTS or COLLECTION_MONTH? How do you know there's only one significant KEGG in TYPES_OF_PLANTS? What information is provided by your KEGG list of doom? What conclusions should I draw from the figures you're generated?

mortonjt commented 9 years ago

I'm actually addressing the multiple hypothesis problem. Its under kw_test

Just realized that some oral/skin samples were in the notebook. The heatmaps are actually a big clearer now. Will upload them tonight.

jwdebelius commented 9 years ago

Sorry, I missed that line. But, I have trouble believing there are 1000 KEGGs that are biologically relevant. The Kruskal Wallis test is typically pretty conservative, and 20% of your results being significant after correction seems very high to me.

wasade commented 9 years ago

First pass comments

last update is presumably 2015?

acronym is PICRUSt

suggested opening text revisions

" PICRUSt is a tool that can estimate functional metagenomic profiles given 16S sequencing data. PICRUSt works by first using an ancestral state reconstruction method to infer likely profiles for ancestral nodes on a phylogenetic tree from available whole genome annotations. It then predicts profiles for tips of the tree that are not associated with annotated genomes. The original profiles, as well as the predicted profiles, can then be used to infer a potential profile for an input sample by using the 16S gene abundance information in combination with the profile annotations.

PICRUSt tends to work reasonably well within human associated environments as it is within human associated environments that the majority of sequenced and annotated genomes have originated. This method is not as reliable in less well characterized environments, such as soil, as fewer sequenced genomes are available.

In this tutorial, we will be using 16S genes to predict the functional metagenomic profiles in the samples. The gene annotations are based on those provided by KEGG.

In this tutorial, we will be applying PICRUSt to the American Gut dataset to determine if there are significant functional differences between participants gut bacteria with respect to the number of types of plants consumed and the collection month that the sample was obtained. This is used as a hypothesis deriving tool and cannot be used to make any conclusions - followup metagenomics surveys will need to be conducted to validate these hypotheses. * ACTUAL METAGENOMIC STUDIES STILL DONT VALIDATE AS METAGENOMIC DATA BY ITSELF PROVIDES FUNCTIONAL POTENTIAL NOT ACTUAL EVIDENCE OF A PATHWAY OR GENE BEING USED * "

Suggest noting at the beginning that picrust has not yet been adopted for biom 2

krustal-wallis -> Kruskal-Wallis

group_indeces -> group_indices

Krustal-Wallis -> Kruskal-Wallis

it potentially makes sense to rarefy after prediction.

...should truncate or capture the output on the list of significant KOs

Bladder cancer is odd

The traceback at the end should be removed

mortonjt commented 9 years ago

Thanks for the review!

@JWDebelius , I got a bit side tracked on deciding which analyses to conduct, so I neglected documenting this notebook. Added in some content to clarify some of the analysis.

Its pretty crazy that there are that many significant KEGGs for collection month right? I'm not sure what to think of it, but the heatmap seems pretty clear.

Still not sure how to interpret the last heatmap though ...

The bladder cancer pathway definitely appeared before and I remember discussing it in one of the AG meetings. I remember that the consensus was that it doesn't mean that bladder cancer was more prevalent, rather there are more KEGGs that are orthologous to proteins commonly expressed in cancerous bladder cells. ( ortholog != function right? )

@wasade :+1: on the picrust explanation.

About the rarefaction. From what I understand, rarefaction is just another form of normalization, best suited for calculating ecological statistics (e.g. diversity). However, in this scenario, I'm performing statistical tests directly on the sample proportions. The only thing rarefaction would be doing in this scenario is degrading the estimate of the sample proportion.

Within the next few weeks, I'll be conducting benchmarks on the side, comparing compositional statistical methods. Once these benchmarks are complete, then I'll give some hard evidence for/against rarefaction.

I'll be making some major changes in the next few days to this notebook. Namely, I'll revisit

mortonjt commented 9 years ago

Going to investigate some possible errors in this notebook, so I'll be temporarily closing this PR