Functional analysis of (meta)proteomics datasets

bmesuere commented 7 years ago

A next step for Unipept is adding support for a functional analysis next to the existing taxonomic analysis. While the problem sounds similar to the taxonomic one, there are a few problems:

Available data: each UniProt entry has an associated taxon, but the same isn't true for functional annotations. Only 12% has an annotation with an EC number, GO terms are present for 60% of the entries.
UniProt proteins are associated with a single taxon, but can have multiple functions and thus multiple GO terms (2.8 terms per entry on average). These could be aggregated on a UniProt level or on a peptide level. The quality of the annotations is sometimes also disappointing.
We can't use the LCA algorithm of the taxonomy. GO exists out of 3 disjunct parts which are directed acyclic graphs whereas the taxonomy is a tree. This means that a node can have multiple parents. Aggregation must thus be done differently.

Finally, the question of "what is the expected output of a functional analysis" remains. Many articles pick a type of functional annotation and simply include a pie chart. If you ask a biologist if he learned something from the pie chart, the answer is almost always "no". We should be able to do better than that. Suggestions or good examples of such data visualisations are always welcome.

PratikDJagtap commented 7 years ago

@bmesuere and @NoctuaNivalis - we generally use MEGAN6 for functional analysis. However, the limiting step for MEGAN6 analysis is the BLAST-P search step and relative expertise needed to perform MEGAN6 analysis. I think it is essential to have an independent tool that generates similar (if not identical outputs). I think the ability to perform functional analysis using the UniPept infrastructure would be great. Here are some examples (manuscript / links) that will help us define requirements. a) Can the tool allow functional analysis using InterPro2GO, eggNOG or SEED? b) Can the tool allow for pathway analysis using KEGG or iPath (http://letunic.com/publications/18276143.pdf)

One good example, for the kind of outputs expected is: http://biotechnologyforbiofuels.biomedcentral.com/articles/10.1186/s13068-016-0572-4

Going from simple to complex - a) Bar diagram b) Pie chart (both misleading since proteins are distributed across various functional categories) c) KEGG pathway representation and d) Principal component analyses based on biologcal processes. I think starting from simple to more complex would be a good idea.

This is an important tool in this hackathon and will have an immediate use / demonstration at the ABRF 2017 workshop that Tim, Jj and myself are planning to conduct (https://conf.abrf.org/the-galaxy-platform)

@alessandrotanca, @ Prof Rudney, Carolin Kolmeder, Tim Griffin might want to add on what features would be required for this tool.

alessandrotanca commented 7 years ago

@PratikDJagtap @bmesuere Concerning the type of functional annotation, we noticed that KO, eggNOG, InterPro, and UniProt protein families provide complementary information (in several cases a sequence is annotated using a database but not with the other, and for another sequences is the vice versa). As I wrote yesterday, the optimum would be a tool enabling merging of outputs from several databases to maximize functional annotation yield. Concerning visual representation of the outputs, I completely agree with you about uselessness and ambiguity of pie charts. A quite good representation of abundance and dispersion can be provided by boxplots (e.g. using BoxPlotR, http://boxplot.bio.ed.ac.uk/). For complex metabolic visualization KEGG pathway maps are really useful in my experience, but the free functions available on the KEGG website do not allow export of high resolution images or svg files (or you have alternative ways to download them?). I also agree on the great usefulness of function-based PCA and heatmaps for visualization of differential results. I'm also really happy with cladograms, that can be generated using a tool available on the Huttenhower's lab Galaxy website (GraPhlAn, https://huttenhower.sph.harvard.edu/galaxy/). ...and we are also open to newer and more clever solutions for data visualization!

jrudney commented 7 years ago

I agree with Allesandro that the best strategy is to compare results obtained from different ontologies, which can be done in MEGAN. The key issue in adding the same capabilities to Unipept will be keeping up with changes to the databases. BLAST-nr changed dramatically in September (no more GI numbers) and MEGAN has already released several updates of its ontology mapping files for InterPro2Go and and eggNOG since then since then. The problem with any online tool is going to be keeping it updated regularly so the results don't become obsolete. Another challenge with all ontologies is that there any many proteins which are assigned to multiple functional groups. This can cause considerable bias in display methods such as pie charts.

ckolm commented 7 years ago

In the past I have been trying COG, eggNOG and KEGG for functional annotation of proteins from proteins derived from human faecal samples. Out of the three eggNOG performed the best.

jrudney commented 7 years ago

My experience in MEGAN has been that the performance of different ontologies (interPro2GO, eggNOG, SEED, and KEGG ) relative to each other does not remain consistent. It can change each time a mapping file for any given ontology is updated. Since they are not all updated at the same time, results of a comparison between ontologies can vary over time.

ckolm commented 7 years ago

As Pratik mentioned earlier KEGG is very repetetive and in my opinion not very suitable for microbes. With eggNOG (outside of MEGAN) I made very good experience with getting a function to more than 90% of my proteins.

jhervey4 commented 7 years ago

This tool is quite useful for querying large numbers of sequences thru eggNOG: https://github.com/jhcepas/eggnog-mapper

bmesuere commented 7 years ago

Thanks for your input, this really helps us a lot in figuring out what to build. Right now, we were focussing on how to do the consensus calculation for a single peptide (i.e. peptide -> multiple proteins -> multiple potential annotations per protein -> aggregation to a single function) and thing have something workable for GO and EC. Using multiple annotation sources would complicate this process a bit since this means adding (and eventually reducing) another dimension. Ideally we would like to precompute the consensus for each peptide in our database to achieve an acceptable performance.

On the visualisation side, we think that we can create an added value by combining the taxonomic information with the functional in an interactive visualisation. This means that selecting a taxon in the visualisation would highlight the "functions" that are mapped to the corresponding peptides (this is usually called "brushing" is visualisation lingo).

/cc @NoctuaNivalis

galaxyproteomics / tools-galaxyp

Functional analysis of (meta)proteomics datasets #89