biocore / emp

Code repository of the Earth Microbiome Project.
http://www.earthmicrobiome.org
BSD 3-Clause "New" or "Revised" License
154 stars 73 forks source link

100 most wanted list #23

Open gregcaporaso opened 11 years ago

gregcaporaso commented 11 years ago

The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.

jairideout commented 11 years ago

Greg and I discussed this and decided on a sorting scheme. The most wanted list will only include "new" OTUs (i.e. ones that were created de novo, not from greengenes).

Sorting priorities: 1) Sort by the number of environments the OTU is found in. 2) Sort by the total count across all environments. 3) Sort by % dissimilarity to greengenes. 4) Sort by % dissimilarity to NCBI nr database.

Output should include a tab-separated table containing the sorted most wanted OTU IDs, sequence, greengenes assigned taxonomy, and NCBI closest sequence link.

Additional output should be an HTML table (for easy integration into the EMP website) that contains the information above plus a piechart showing the abundance of the OTU in each environment.

jairideout commented 11 years ago

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

1) Filter to only include novel OTUs. 2) Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500). 3) Filter to only include OTUs that are in at least N environments/sample types. 4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar). 5) BLAST the rest against nt and sort by % dissimilarity. 6) Pick the top N from those.

We'll see how this works...

rob-knight commented 11 years ago

We could just look at the ones that were new clusters (i.e. don't have gg ids because they failed ref picking), right?

On Aug 8, 2012, at 11:27 AM, jrrideout wrote:

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

1) Filter to only include novel OTUs. 2) Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500). 3) Filter to only include OTUs that are in at least N environments/sample types. 4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar). 5) BLAST the rest against nt and sort by % dissimilarity. 6) Pick the top N from those.

We'll see how this works...

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7590793.

jairideout commented 11 years ago

Yes, that will be the first step in the process, but I think we'll need to do additional filtering (steps 2-5) to get a good list, because many of these novel OTUs might be very similar to either gg seqs or nt seqs.

gregcaporaso commented 11 years ago

@meganap, would you be able to help @jrrideout with some css magic to make the html table that he's putting together for this look a little nicer?

meganap commented 11 years ago

sure no prob

jairideout commented 11 years ago

@meganap awesome, thanks! I'm finishing up some changes tonight and will have the table in the repo sometime tomorrow. Will let you know when it is ready.

gregcaporaso commented 11 years ago

Once @meganap takes a crack at it, it'd be best to include her css in the html generation code for future runs.

jairideout commented 11 years ago

@meganap, the table is in the repo now under isme14/most_wanted_otus/most_wanted_otus.html. To view it, open it up in a web browser (I've tried out Chrome and Firefox) and it should find all of the other files it needs (they are all under that same directory).

I tried to keep styling to a minimum. The table has the id 'most_wanted_otus_table' and each of the subtables for the piechart legends have the class 'most_wanted_otus_legend'. If there's anything else I can do from my end to help make this HTML better stylizable, please let me know.

I think the goal was to add this table to one of the EMP webpages. Thus, I'm not sure if we should directly add the CSS to the table-generating code as @gregcaporaso suggested because it may better to just use the EMP CSS stylesheets that are already in use on the website. You may need to get in touch with @douginator2000 to get access to those if you don't have them already. If we go this route, the table-generating code will be able to create generic tables which can then be styled according to whatever website scheme it might be dropped into (thinking of additional uses for this table besides the EMP website).

Thanks again for your help with this, and please let me know if you come across any issues.

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance. You mentioned offline that there might be a way to get access to a node with more memory (>69GB). Do you still want to go this route, or just use the table that we have?

jairideout commented 11 years ago

@meganap, I forgot to mention that the second column in the HTML table needs to keep its contents formatted as-is (I'm using pre tags currently, maybe there is a better way to do this though). We just need to keep it formatted with fixed-width font and have those linebreaks respected.

meganap commented 11 years ago

@jrrideout cool, I'll take a crack at this tomorrow

gregcaporaso commented 11 years ago

Thanks guys!

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance.

I think we just have to go with this for right now, but for the paper we'll get this running on a system with more memory.

gregcaporaso commented 11 years ago

@douginator2000, when this is ready could you add a another collapsable section on the EMP login page (same place as the summary statistics, etc)?

meganap commented 11 years ago

@gregcaporaso @jrrideout Sorry I didn't get a chance to work on this yet since I was working on figures for other isme stuff, but is there still time for this?

jairideout commented 11 years ago

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

rob-knight commented 11 years ago

Yes still useful, deadline sunday

On Aug 17, 2012, at 4:35 PM, "jrrideout" notifications@github.com<mailto:notifications@github.com> wrote:

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7834337.

meganap commented 11 years ago

hey @jrrideout I noticed that there aren't any html headers for the file and that it just starts off with divs. Is there a reason for this? Adding css styling is only possible if we have html headers.

jairideout commented 11 years ago

@meganap, @gregcaporaso requested that I only output the HTML table so that it could be easily dropped into a webpage. Please feel free to modify/add to the HTML as needed to style it (this table will ultimately need to be added to the EMP login page).

meganap commented 11 years ago

@jrrideout I've edited the script that writes the html so it writes some stuff in a different way, can you send me the full command you used to run that script (like where the test files are?) so that I can rerun it?

jairideout commented 11 years ago

@meganap I'll have to rerun it because it requires the entire nt database, and everything is already set up for this in an EC2 instance. Can you please update the accompanying unit tests and check in your changes? Once they're in, I'll rerun it and commit the latest results to the repo. It won't take long to run.

jairideout commented 11 years ago

@meganap The changes are in; please let me know if you run into any issues.

jairideout commented 11 years ago

@douginator2000 this is all ready to go. All relevant files are under isme14/most_wanted_otus/. The only file that you can exclude from there is 'analysis_notes.txt'. Thanks!

@meganap thanks for your help in spicing up the table- it looks really good!

gregcaporaso commented 11 years ago

Hey guys, This is awesome, thanks! Doug, could you get this accessible via the EMP site?

In the meantime I posted here to make it easier for everyone else to see: https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the metadata categories in addition to the percentage, but I think that can wait. (Thanks for the suggestion Daniel!)

Greg

rob-knight commented 11 years ago

Yes this is spectacular -- thanks for putting together! Could we get a tree showing where in phylogeny the 100 most wanted are?

On Aug 19, 2012, at 11:53 AM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

Hey guys, This is awesome, thanks! Doug, could you get this accessible via the EMP site?

In the meantime I posted here to make it easier for everyone else to see: https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the metadata categories in addition to the percentage, but I think that can wait. (Thanks for the suggestion Daniel!)

Greg

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7851896.

gilbertjack commented 11 years ago

Am I right to think that the criteria for this are those that @jrrideout came up with:

1) Filter to only include novel OTUs. 2) Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500). 3) Filter to only include OTUs that are in at least N environments/sample types. 4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar). 5) BLAST the rest against nt and sort by % dissimilarity.

gregcaporaso commented 11 years ago

Yes, that's right. @jrrideout, correct us if we're wrong here.

gilbertjack commented 11 years ago

ok but what were the N's for these two filters: 3) Filter to only include OTUs that are in at least N environments/sample types.
 4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).

jairideout commented 11 years ago

@gilbertjack The steps 1-5 listed above are what I used. Here's the parameters I ended up using:

1) filtered out against gg 97 2) abundance: 100 < OTU count < 500 3) at least 4 environments 4) included only OTUs that were at least 20% dissimilar (according to uclust) from gg 97 5) only included OTUs that were 97% similar or less compared to the NCBI nt database (according to blastall)

So we only ended up with 45 OTUs that were left over after all of that filtering. Please let me know if you have any additional questions regarding how this list was generated.

@gregcaporaso @rob-knight I think these feature requests sound great, though I will not have time to work on them to meet the deadline today.

gregcaporaso commented 11 years ago

Thanks a lot!

gilbertjack commented 11 years ago

AWESOME, thanks

cuttlefishh commented 8 years ago

@rob-knight said: EMP most wanted and picrust definitely valuable this time around (i.e. are there “most wanted” that are in environments with “interesting” parameters?).