biocore / emp

Code repository of the Earth Microbiome Project.
http://www.earthmicrobiome.org
BSD 3-Clause "New" or "Revised" License
156 stars 68 forks source link

are there any shared OTUs between all ecosystems surveyed #14

Open gregcaporaso opened 12 years ago

gregcaporaso commented 12 years ago

I would do this as follows:

1 - define a function that takes a biom-table object and returns a list of the otu ids that have a count of at least n (where n is a parameter to that function) in all samples. this would be similar to qiime.filter.filter_otus_from_otu_table, where you define a filter function that gets passed to table.filter_observations.

2- iterate over the list of OTU tables (see issue #25 for why that is necessary), parsing the BIOM table with biom_format.parse.parse_biom_table, and passing the table to the function defined in step 1, and store the list of otus that are returned.

3- take the intersection of the results of step 2.

I wouldn't be surprised if the answer was that no OTUs are shared across all of the samples. In that case it may be worth investigating whether there are OTUs that are present in at least 99% of the samples, etc, making the percentage parameterizable. You'd achieve this by building a different filter function that gets used in step 1.

gregcaporaso commented 12 years ago

@lkursell, let me know if you need anything as you work on the test code up today.

lkursell commented 12 years ago

@gregcaporaso, the biom tables in the dropbox per_study folder all have "metadata": null, am I looking at the wrong tables? lukeursell ~/Desktop/code_emp/isme14/per_study_otu_tables $grep -c taxonomy *.biom otu_table_mc2_1031.biom:0 otu_table_mc2_1034.biom:0 otu_table_mc2_1035.biom:0 otu_table_mc2_1036.biom:0 otu_table_mc2_1037.biom:0 otu_table_mc2_1038.biom:0 otu_table_mc2_1039.biom:0 otu_table_mc2_1222.biom:0 otu_table_mc2_1235.biom:0 otu_table_mc2_1240.biom:0 otu_table_mc2_1242.biom:0 otu_table_mc2_1288.biom:0 otu_table_mc2_1289.biom:0 otu_table_mc2_1453.biom:0 otu_table_mc2_1526.biom:0 otu_table_mc2_550.biom:0 otu_table_mc2_632.biom:0 otu_table_mc2_638.biom:0 otu_table_mc2_659.biom:0 otu_table_mc2_662.biom:0 otu_table_mc2_678.biom:0 otu_table_mc2_722.biom:0 otu_table_mc2_723.biom:0 otu_table_mc2_808.biom:0 otu_table_mc2_809.biom:0 otu_table_mc2_810.biom:0 otu_table_mc2_925.biom:0 otu_table_mc2_933.biom:0

gregcaporaso commented 12 years ago

No, taxonomy assignment hasn't completed, but I don't think you need it for this - do you? Your analysis should be at the OTU level.

lkursell commented 12 years ago

OK, I'll just report these: emp.isme14..14.CleanUp.ReferenceOTU368968

gregcaporaso commented 12 years ago

Thanks. Report the sequences as well and we can classify those directly (which will obviously go much faster than running it for all of them). One easy way to do that would be to have your code output a file where each line contains tab-separated text. The first field should be the OTU id, and then information such as how many samples/biomes it showed up in in subsequent fields. You can then call filter_fasta.py passing that file as -s, and the new_refseqs.fna.gz as -f to get the corresponding sequences, and classify the resulting fasta file with assign_taxonomy.py (retraining against Greengenes).

Will that work? Sorry for the round-about way of getting these data!

You can find new_refseqs.fna.gz here: https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_refseqs.fna.gz?raw=true

lkursell commented 12 years ago

I'm trying to get the new_refseqs file, but I get a 'Error: blob is too big' when clicking the link in Firefox or Chrome, or when using curl from the terminal.....

jairideout commented 12 years ago

If you do a clone of the repo you'll have access to it:

git clone https://github.com/EarthMicrobiomeProject/isme14.git

On Fri, Aug 17, 2012 at 10:27 AM, lkursell notifications@github.com wrote:

I'm trying to get the new_refseqs file, but I get a 'Error: blob is too big' when clicking the link in Firefox or Chrome, or when using curl from the terminal.....

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/14#issuecomment-7829396.

gilbertjack commented 12 years ago

Hey Guys,

Do we have a list yet? for this?

cuttlefishh commented 8 years ago

@mortonjt Are you interested in taking this one on?

cuttlefishh commented 8 years ago

Also want to assign @amnona but it's not letting me.

mariaasierra commented 5 years ago

Any update on this?

cuttlefishh commented 5 years ago

Hi @alehsierra, there was some discussion over on @biocore/american-gut-devs about doing this for the American Gut project (https://github.com/biocore/American-Gut), but that wouldn't address a diversity of environments. It would be easy to do with the EMP BIOM table and mapping file. In our analysis, we did look at the most prevalent sequences across the dataset (https://media.nature.com/original/nature-assets/nature/journal/v551/n7681/extref/nature24621-s4.xlsx). For this issue, it would simply be a matter of cross-reference the BIOM table with the environment types (e.g. empo_3) in the mapping file, and seeing which sequences are found at least once in at least one sample for each sample type.