Generating proteomic databases from 16S rRNA / taxonomy data.

PratikDJagtap commented 7 years ago

TOOL IDEA: Given that most of the metagenomics studies are based on 16S ribosomal RNA-based taxonomy identifications, a tool available that can take in species names as an input and parse out proteomes (if available) from UniProt website - would be desirable. In our discussion with researchers working in the field of metaproteomics - this would be a useful tool. Any ideas on effort that would be required to build this tool?

Suggestions (From emails in November 2014)-

A) Suggestion by Ira Cooke (@iracooke Australia):

Uniprot has a great API … so if you know the species identifier (or list of them) you can get a customized database direct from Uniprot by downloading using a special url that contains all the taxonomic identifiers. This negates the need for a merge step.

This is an example (Dog and Mouse)

http://www.uniprot.org/uniprot/?query=taxonomy%3a9615+OR+taxonomy%3a10090&force=yes&format=fasta

I guess the trick would be to go from species names to taxon ids … since this is inherently fuzzy (species might be listed under a different name from what you expect). For my purposes I just do this by hand using uniprot via the ncbi taxonomy database … but if you have a bulk list of species names I wouldn’t be sure how to do it in an automated way (unless all the species names had a perfect match in the database).

I believe this is the best option as it is simple (just a galaxy tool), it doesn’t require storing data locally and it will always give the latest data. It is also precise as there is no reliance on parsing names.

One missing piece is the “Species -> TaxonID” tool, but could be done using a local download of the NCBI Taxonomy data (or a web API .. I haven’t looked but Uniprot might even provide this too). I’d actually say that you’re better off getting away from using species names if possible … to be precise you need the taxon id’s at some point anyway.

B) Suggestion by Lennart Martens (Belgium):

DBToolkit can do this from the local, complete UniProt file (in .dat format) for species as well as for entire taxons, specified as either the text string ('homo sapiens') or the TaxIDs (9606). As stated above, it does require a local version of the file, however.

Conclusion:

Most 16S rRNA studies offer lists of identified species (and strains). It would be good idea to take this list and a) either convert into taxonomy identifiers or b) submit as species names through 1) UniProt API or 2) some features from db toolkit to 3) generate a FASTA file of available proteomes.

bgruening commented 7 years ago

@PratikDJagtap is this a first step? https://github.com/bgruening/galaxytools/blob/master/tools/uniprot_rest_interface/uniprot.xml

PratikDJagtap commented 7 years ago

Yes, if this can take a taxon ID (or species name) as an input and generate a protein FASTA as an output. I am copying @iracooke to see if he has any inputs.

jj-umn commented 7 years ago

@PratikDJagtap Did you want this to take a dataset with a NCBI taxon id column as input? I think the flow would be sixgill -> unipept -> uniprot. I could update https://github.com/galaxyproteomics/tools-galaxyp/tree/master/tools/uniprotxml_downloader to do that.

jj-umn commented 7 years ago

@bgruening Should uniprotxml_downloader be subsumed by uniprot_rest_interface? uniprot_rest_interface could just add the taxon queries as additional condition options and add the xml output format for the morpheus application.

bgruening commented 7 years ago

@jj-umn sounds good to me. Less overlap between tools is always good. I had a few problems with the rest interface if you have to many requests in a short time. So we should keep an eye on this.

PratikDJagtap commented 7 years ago

@jj-umn Yes, it would be a good idea to have the ability to add taxon identifiers so that the application can fetch protein sequences from UniProt to generate a customized protein FASTA file. However, this would be a separate path than SixGill (which would use WGS data). As @iracooke had stated: "One missing piece is the “Species -> TaxonID” tool, but could be done using a local download of the NCBI Taxonomy data (or a web API .. I haven’t looked but Uniprot might even provide this too). "

alessandrotanca commented 7 years ago

I try to schematize the pipeline we currently carry out to generate 16S rRNA-based databases, hoping this can help: 1) 16 rRNA analysis results are usually in a BIOM format (e.g. an OTU table generated by QIIME, like the one attached), therefore it would be great if the taxonomic info could be retrieved from such a file 2) in our experience, considering taxa at the genus level is a good compromise between database size and completeness (see Tanca et al. Microbiome 2016), but the family level is also fine 3) we also usually set an abundance threshold (e.g. 0.01%), as low abundance taxa assignments are usually more prone to false positives 4) then we retrieve from UniProt the FASTA file containing all protein sequences assigned to all taxa (e.g. genera) reaching the above threshold, simply by typing directly in the search box something like taxonomy:faecalibacterium OR taxonomy:roseburia OR taxonomy:alistipes raw_otu_table_example.txt

jj-umn commented 7 years ago

@PratikDJagtap https://github.com/galaxyproteomics/tools-galaxyp/tree/master/tools/uniprotxml_downloader has selection for commonly used uniprot organisms or one can enter the "Organism ID" (NCBI taxon ID) that can be looked up at: http://www.uniprot.org/proteomes/

PratikDJagtap commented 7 years ago

Jim Johnson @jj-umn 09:08 @alessandrotanca @PratikDJagtap I'm just trying the raw_otu_table_example.txt file for input organism names. Are there other formats from which we should also parse out organism names, such as biom?

alessandrotanca @alessandrotanca 09:17 @jj-umn @PratikDJagtap I've just asked to my metagenomics colleagues, and they confirm that the standard format for taxonomic classification of operational taxonomic units (OTUs) provided by the most widespread tool (QIIME) is the following: "kkingdom; pphylum; cclass; oorder; ffamily; ggenus" sorry, but the underscores have been converted into formatting! Therefore, by retrieving the text after f__ one should have the name of the family to which that particular OTU has been assigned.

jj-umn commented 7 years ago

@PratikDJagtap @alessandrotanca I've used Galaxy tool: "Query Tabular" to select a distinct set of the rightmost Taxon name from raw_otu_table_example.txt Then I've updated our locally installed UniProtXML downloader to accept a file containing a Taxon Name column. I'm checking it's capacity limits now. It successfully downloaded 1,321,992 fast entries for a list of 22 taxon names in 12 minutes. I'm now trying with a list of 269 names.

PratikDJagtap commented 7 years ago

Awesome JJ! Once the basic tool is available we can try and see if this can be placed on JetStream for testing so that @alessandrotanca and others can provide inputs to make its a robust tool. Thanks!

jj-umn commented 7 years ago

@PratikDJagtap @alessandrotanca I gave UniProtXML downloader a list of 269 names it downloaded 36 million fasta sequences in 2 hours.

alessandrotanca commented 7 years ago

@jj-umn @PratikDJagtap Wow!

PratikDJagtap commented 7 years ago

@jj-umn gave UniProtXML downloader a list of 269 names it downloaded 36 million fasta sequences in 2 hours! Awesome @jj-umn !!! We have the first prototype of 16S rRNA database downloader.The names also had family names (hence the large size of the database). It will be a good idea to restrict this to genera or species names. Comments welcome!

jj-umn commented 7 years ago

UniProtXML downloader available on toolshed https://github.com/galaxyproteomics/tools-galaxyp/tree/master/tools/uniprotxml_downloader https://toolshed.g2.bx.psu.edu/view/galaxyp/uniprotxml_downloader/e1abc9a35c64

galaxyproteomics / tools-galaxyp

Generating proteomic databases from 16S rRNA / taxonomy data. #86