biocore / emp

Code repository of the Earth Microbiome Project.
http://www.earthmicrobiome.org
BSD 3-Clause "New" or "Revised" License
154 stars 68 forks source link

update of 'places to look for new diversity' plot #9

Closed gregcaporaso closed 12 years ago

gregcaporaso commented 12 years ago

Also, filter new OTUs to a new-only OTU table, summarize taxa, and define how many of these new OTUs are uncharacterized at the phylum level, at the class level, ...

this is similar to this: http://www.nature.com/ismej/journal/vaop/ncurrent/fig_tab/ismej201279t1.html#figure-title

jairideout commented 12 years ago

@gregcaporaso are you wanting the unclassified summary table to have counts of novel OTUs that couldn't be classified at a specific taxonomic level (i.e. how many different novel OTUs were unclassifiable), or do you want the abundances added up across all samples for all unclassifiable novel OTUs? The ISME table seems to be the former, though I'm not 100% sure.

gregcaporaso commented 12 years ago

Sorry, I think I had combined a couple of different ideas here. To start out, we'll want an update of the plot from the AAAS meeting, which shows number of new OTUs (with respect to Greengenes) per biome as series of box plots. If we end up with taxonomic assignments for all of them (which we may or may not have, depending on compute time) than a table like the ISME one linked above would also be very useful.

jairideout commented 12 years ago

@gregcaporaso the AAAS code plots the percentage of sequences that failed to make it into a GG OTU by sample type, so I'll make it possible to produce this type of plot as well as the one you describe above.

gregcaporaso commented 12 years ago

Perfect- thanks!

jairideout commented 12 years ago

Script has been committed, along with unit tests and script usage tests. Right now it generates the two types of plots discussed above, but doesn't do anything with taxonomy yet. Code has partially been written for the taxonomy tables.

Here are examples of the two types of plots that are generated. These are NOT based on real data, and suffer from the same issue found in the alpha diversity by sample type plots (i.e. they are very small datasets and the axes min/maxes aren't autoscaled very well). These can be easily tweaked as necessary once they are run on real data.

This first plot shows the number of unique novel OTUs (not by abundance or anything, just the number of new OTUs) by sample type:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/code/script_test_data/new_diversity_places/new_diversity_out/num_novel_otus_by_Environment.pdf?raw=true

This plot shows the percentage of novel sequences by sample type:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/code/script_test_data/new_diversity_places/new_diversity_out/percent_novel_seqs_by_Environment.pdf?raw=true

gregcaporaso commented 12 years ago

Looks good - thanks!

gregcaporaso commented 12 years ago

Metadata categories of interest here are ENV_MATTER, ENV_BIOME, ENV_FEATURE, STUDY_ID. Any other suggestions?

jairideout commented 12 years ago

Plots have now been committed to repo under isme14/new_diversity_places/. The commands that I used are in analysis_notes.txt under that directory. The directory contains two plots for each mapping category (see my previous post for descriptions of what these plots are). Here's direct links to the plots of interest. I tweaked these a little bit, and it is easy for me to change plot size, labels, axes limits, etc. as needed. The raw data for each plot is also included in python pickled format so that I can load it up and tweak the plots without rerunning the entire script.

ENV_MATTER:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/num_novel_otus_by_ENV_MATTER_fixed.pdf?raw=true

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/percent_novel_seqs_by_ENV_MATTER_fixed.pdf?raw=true

ENV_BIOME:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/num_novel_otus_by_ENV_BIOME_fixed.pdf?raw=true

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/percent_novel_seqs_by_ENV_BIOME_fixed.pdf?raw=true

ENV_FEATURE:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/num_novel_otus_by_ENV_FEATURE_fixed.pdf?raw=true

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/percent_novel_seqs_by_ENV_FEATURE_fixed.pdf?raw=true

STUDY_ID:

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/num_novel_otus_by_STUDY_ID_fixed.pdf?raw=true

https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_diversity_places/percent_novel_seqs_by_STUDY_ID_fixed.pdf?raw=true

gilbertjack commented 12 years ago

This is excellent, thanks @jrrideout

@gregcaporaso should I have someone (e.g. @dansmith01) start making generic slides for these - Daniel Smith is already making ones for the environmental gradient images.

gregcaporaso commented 12 years ago

These are looking great - thanks!

One feature request: would it be possible to filter categories that contain less than n samples, where n is user defined? Also, filtering the 'NA' category would be good.

Also, in addition to the "STUDY_ID" plot, could you add a plot for "TITLE"? Should be the same data, but the axis labels will be more interpretable.

jairideout commented 12 years ago

Sure, I'll filter out 'NA' now and add the ability to specify any number of categories to filter out to the script options. I'll also add the minimum number of samples option.

What minimum number were you thinking of? I think for AAAS you filtered out distributions with 10 or fewer samples. Should I go ahead with this for now?

gregcaporaso commented 12 years ago

What minimum number were you thinking of?

@jrrideout, I would regenerate with filtering NA and all categories with less than 10 samples. Thanks!

@gregcaporaso https://github.com/gregcaporaso should I have someone (e.g. @dansmith01 https://github.com/dansmith01) start making generic slides for these - Daniel Smith is already making ones for the environmental gradient images.

@gilbertjack, yes, that would be excellent, thanks!

gilbertjack commented 12 years ago

@dansmith01 - once @jrrideout sends us the updated figures can you make slides containing these?

jairideout commented 12 years ago

The updated plots (including the new TITLE plots) are now committed under isme14/new_diversity_places/. There are 10 plots (5 categories x 2 types of plots).

gilbertjack commented 12 years ago

Thanks @jrrideout

gregcaporaso commented 12 years ago

@jrrideout, are these all done? If so, can you make sure that @dansmith01 has the right files to build slides and then close the issue? Thanks!

jairideout commented 12 years ago

@gregcaporaso yep, these are done. I was waiting to get feedback in case the plots need to be tweaked or anything.

@dansmith01, all of the plots are in the repo under isme14/new_diversity_places/. There are 10 PDFs in that directory, one for each plot. Please let me know if you have any trouble accessing them or need any changes made to them as they are integrated into the slides.