geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
32 stars 10 forks source link

Annotation standards for HTP proteomics experiments #1773

Closed hattrill closed 5 years ago

hattrill commented 6 years ago

The annotation standard for proteomics experiments are not clear enough for curators.

Problems that are causing issues:

  1. Differentiating between a purification and an enrichment
  2. What to do if there is no FDR or a FDR >1.
  3. How do we critically judge when there are "too many contaminants"?

    1469

    1655

rachhuntley commented 6 years ago

The heart of the issue for us is how much can we take into account author justification; a) the experiment meets the current guideline criteria for FDR <1 and unique peptides >2, but there is a significant proportion of proteins that are unexpected in the purification b) the experiment is from a well-respected lab for proteomics and they consider the presence of the 'unexpected' proteins as reasonable (from personal communication) c) the respective scientific community want their results represented in GO, with the cell/tissue type, in order to help answer biological questions

@RLovering

hattrill commented 6 years ago

Ha! I've assigned you all! Not sure if that is appropriate github behaviour or not. Let's pretend it is.

So, I have an "easy" starter question:

For MSMS proteomics experiments, what should the average curator do if there is no FDR or the FDR slightly higher that 1 (these generally represent older papers).

  1. If other methodologies have been used to further refine the final list, allow curator discretion as to whether this increases the validity of the hits enough for annotation.
  2. No FDR or FDR >1, then do not annotate, ever!
  3. Allow the curator to assess the "validity" of the data using current knowledge.
  4. Don't know/don't care.

Either post your answer here or email me. I shall collate the results.

hattrill commented 6 years ago

Hi @rachhuntley, @RLovering I've been doing some thinking about the ECM issues discussed in #1655 and I would like you to think about the following suggestion:

Looking specifically at the ECM issue, I agree that the annotation of ECM components to specific tissues is valuable to researchers.

There are a number of issues:

  1. Defining the ECM. In GO: GO:0031012 extracellular matrix. A structure lying external to one or more cells, which provides structural support for cells or tissues. So, by definition, this should not include non-structural components.

  2. The ECM is quite sticky - lots of HS/GAGs and most of the protein components are pretty insoluble. This presents two problems for purification/MS experiments:

    • difficultly solubilising components for MS
    • protein modifications make identification harder
    • very sticky, means difficult to remove soluble contaminants
    • contaminants may be disproportionally represented in the MS data as they stick to less things and are more easily identified due to fewer mods.
  3. All purification methods that I have looked at describe enrichment rather than purification. This will probably always be the case for ECM preps. The hits we see from the MS experiments are "true" but what do they represent?

    • extracellular matrix "A structure lying external to one or more cells, which provides structural support for cells or tissues."
    • soluble factors specifically bound ECM components
    • secreted factors, non-specifically bound
    • exosomes
    • contaminants e.g. from incomplete decellularization

I have a suggestion: There is a project, http://matrisomeproject.mit.edu/ (PMID:26163349) that had been initiated to help deal specifically with the issues pertaining to ECM profiling (note: these guys are scathing about GO annotation of ECM). I think that it would be quite instructive to get these people involved or at least use their resources. The issues they highlight with ECM preparation is that most of the components are insoluble and some very insoluble - hard to shift either by SDS or urea. The soluble components tend to be remodelling factors, signalling molecules, etc. that don't really constitute the ECM proper. So, they have developed a pipeline (PMID:28675934) that essentially attempts to remove as much cellular contaminants as possible, leaving the insoluble enriched ECM component (note: enriched, not purified). Then they do some funky stuff to optimize MSMS and end up with an utterly filthy list of proteins! They then use in silco analysis to sort the set into "contaminants" and "matrix" components. This allows them to compare different ECM profiling experiments to a similar standard.

PMID:28675934 "Scripts To Annotate Matrisome Proteins and Calculate Mass-Spectrometric Metrics To facilitate the annotations of matrisome proteins in large data sets, we developed a script called “Matrisome Annotator”. Providing that a data set contains Entrez or HUGO gene symbols for each entry, the script will return an output file in which each entry will be annotated as being part of the matrisome or not and will be tagged with matrisome division (core matrisome vs matrisome-associated) and category (ECM glycoproteins, collagens, proteoglycans, ECM-affiliated proteins, ECM regulators, or secreted factors). “Matrisome Annotator” can be used to annotate not only proteomic data but also any kind of list of genes/proteins. We also developed a second script, called “Matrisome Analyzer”, that calculates the proportion of ECM content in terms of number of spectra, number of unique peptides, number of proteins and peptide intensity (i.e., protein abundance) in proteomics data set. This script allows rapid evaluation of the abundance of matrisome versus non-matrisome proteins in any given data set input as a delimited text file and exports the calculation in tables and graphs. Both scripts are available as webtools and to download under the Analytical Tools section of the Matrisome Project Web site (http://matrisome.org/)."

This seems a potentially good way forward, especially if you are aiming to systematically capture ECM profiling experiments. The website contains a large number of datasets already curated.

I think that the RCA evidence code (http://geneontology.org/page/rca-inferred-reviewed-computational-analysis) fits this pipeline perfectly, probably with a specific GO_REF.

The only reservation I have about RCA is that it has a note saying "Note: Annotations using the RCA code should be reviewed after one year, any older than this date will be deleted.". That sounds like a rule that needs to go.

Well, let me know what you think. I could be a plausible solution for other datasets that have troublesome contaminants e.g. plasma membrane preps but can be combined wth computational analysis (e.g. TMpred) to yield a higher quality set for annotation.

rachhuntley commented 6 years ago

Hi Helen,

This sounds like an interesting proposition, however I would like to investigate this a bit more and I'm not sure this will be done before the call next week.

I've had a quick look at the matrisome tool and put a couple of our datasets through and I can already see a few proteins that are specifically mentioned in the paper as being expected in the ECM, e.g. glycoproteins, but they aren't coming up in the tool. So, just to be clear, are you suggesting we only annotate to ECM the proteins that are categorised by this tool and not annotate those that are not categorised?

What I'd like to do is do this more thoroughly and, 1. speak to our ECM expert to see if he agrees that this is a sensible approach with regards to his datasets (his data actually fairs quite well in the tool) and 2. try to contact the matrisome people to see if they could look at the ECM studies we have and incorporate any proteins in these lists that they believe are missing from their database (they say they are open to additional suggestions and will update frequently).

However, looking at the output of the tool, we may have to consider an additional GO term to cover ECM-associated proteins. As the matrisome people say "We believe that the “core matrisome” categories (collagens, proteoglycans and ECM glycoproteins) are robust and not likely to change much with further analyses, at least for mammals and probably other vertebrates (other taxonomic groups clearly do contain additional ECM proteins). However, the “matrisome-associated” categories (secreted factors, regulators and affiliated proteins) are, by their nature, less firmly established and we suspect that they may well evolve in light of subsequent analyses. These latter categories were deliberately “inclusive” — although many proteins within those categories undoubtedly do bind reproducibly to ECM, others may not (see Fig. 1). Our aim was to define categories that would capture all candidate components of the ECM."

So, if we were to take this approach, we should separate the core-ECM proteins from the associated-ECM proteins and try to improve the GO representation of this component. How about having a parent to the existing term called "extracellular matrix region" and then the existing ECM term could be restricted to the core ECM proteins (collagens, proteoglycans and ECM glycoproteins)?

Also, I think the existing definition of ECM should be revised so it's not limited to structural proteins. Here are a couple of definitions I found that say it's more than structural: "The extracellular matrix (ECM) is a collection of extracellular molecules secreted by support cells that provides structural and biochemical support to the surrounding cells." (Wikipedia)

"The extracellular matrix (ECM) is the non-cellular component present within all tissues and organs, and provides not only essential physical scaffolding for the cellular constituents but also initiates crucial biochemical and biomechanical cues that are required for tissue morphogenesis, differentiation and homeostasis." (http://jcs.biologists.org/content/123/24/4195)

As for RCA, I think this might work (I can't speak for @RLovering though, but it's worth noting that we can ISS from RCA in Protein2GO - for the piggy proteins), even if we keep the 'reviewed every 12 months', as it's a fairly simple job to run these lists through the matrisome tool to see if anything has changed - assuming the tool continues to be maintained, of course.

I will update you when I've investigated this a bit more.

hattrill commented 6 years ago

Yes, don't expect you have to have looked over it and made a decision before the call as it does need a bit more investigation to see if it would work for the data you have. As they seem to be setting themselves up as the gatekeepers of ECM, would be good to make sure that their list is as inclusive as possible. I would definitely consult with the ECM community and get their input. I guess that their "core ECM" list would be the structural part that you could annotate with "extracellular matrix component" and then perhaps the "matrisome-associated" component could captured by another term - "extracellular matrix binding"? "a new term".

I think that you are correct in saying that the current structure in GO isn't very satisfactory. 'extracellular matrix component' and 'extracellular matrix' seems somewhat synonymous with their current definitions. A parent term "extracellular matrix region" or even matrisome (!) could perhaps capture soluble and structural components.

hattrill commented 6 years ago

So, had a look over purification vs enrichment issues, talked to a couple of MS people and came up with a couple of paragraphs that might give a bit of help for the guidelines. We can discuss this on Tuesday. Feel free to comment before.

Assigning cellular component terms based on HTP analysis: Although the quality of the mass spectrometry is relatively easy to determine, it can be more difficult is to assess the quality of the purification. In general, the authors should have taken steps to reduce the contaminants in the sample, but it is up to the curator to judge whether the sample is merely an enrichment rather than a purification. Methods that a curator should expect to see in an high-quality purification protocol include: -Multi-step purification (e.g. tandem affinity purification, PMID:20658971), -Purification optimisation, -Verification of purity by assaying contaminants & known components Strategies that couple purification techniques with data analysis can significantly decrease the number of false positives arising because of contaminants. For example: -Excluding components that do not appear in replicates and repeats, -Multivariate data/principal component analysis (PMID:22472443, PMID:27278775, PMID:25165137) e.g. LOPIT (PMID:15295017).

For some cellular components, a high degree of purity may be difficult to achieve. In such cases there may be a sequence motif that may act as a good computational handle to pull out the most likely candidates. Plasma membrane protein purification is an example where the hydrophobicity of the sample is problematic and using a transmembrane prediction tool to reduce the list to those proteins likely to be integral to the membrane may substantially reduce false positives. In these cases the RCA (inferred from Reviewed Computational Analysis) evidence code should be used (http://geneontology.org/page/rca-inferred-reviewed-computational-analysis). If this analysis was performed outside of the publication, then a GO Reference should be made to outline the protocol used.

rachhuntley commented 6 years ago

Note: this thread had split into two issues. I have moved all the comments regarding curating ECM proteins from proteomics experiments to a new issue #1796

rachhuntley commented 6 years ago

Hi Helen,

I think this list is a good place to start, but I think people need to look at their CC proteomics papers and work through this list to see if it works, or whether anything can be added or altered.

To make this simpler, I have created a spreadsheet so that we can try this out on some of our papers. We've added a couple of our papers so far: https://docs.google.com/spreadsheets/d/1lU82ErpJt59v-r1Ylp6QTi-XyS_qPqvUkqQdP7ZH85I/edit?usp=sharing

Just a couple of things have already come up whilst doing this:

  1. We can check off those items in the above list that are covered in the paper, but how many of these techniques should be present for it to be deemed acceptable to annotate?
  2. Can you clarify what you would expect to see from 'purification optimisation'?
  3. Multi-step purification may not always be necessary for some CCs, e.g. a lot of our HTP CC annotations are for extracellular space (e.g. from blood plasma), which is commonly 'purified/enriched' by a single-step ultracentrifugation to remove the cells. As long as they check for cellular contaminants, this should be acceptable?
hattrill commented 6 years ago

Hi Rachel,

Good idea to collect some examples.

I think my "should" was a little too strong a statement (as I don't expect all to be present): "Methods that a curator might expect to see in an high-quality purification protocol include:"

And, yes, some purifications a can be a simple one-step and all done!

hattrill commented 6 years ago
  1. Can you clarify what you would expect to see from 'purification optimisation'?

What I mean by this, is that there has been some effort to improve the protocol to minimize the ratio of contaminants to expected proteins. Although this isn't shown in most publications, they may reference an earlier paper where this was done.

hattrill commented 6 years ago

So updated to reflect comments:

Assigning cellular component terms based on HTP analysis: Although the quality of the mass spectrometry is relatively easy to determine, it can be more difficult is to assess the quality of the purification. In general, the authors should have taken steps to reduce the contaminants in the sample, but it is up to the curator to judge whether the sample is merely an enrichment rather than a purification. Methods that a curator might expect to see in an high-quality purification protocol include: -Multi-step purification (e.g. tandem affinity purification, PMID:20658971), -Purification protocol optimisation, -Verification of purity by assaying contaminants & known components Strategies that couple purification techniques with data analysis can significantly decrease the number of false positives arising because of contaminants. For example: -Excluding components that do not appear in replicates and repeats, -Multivariate data/principal component analysis (PMID:22472443, PMID:27278775, PMID:25165137) e.g. LOPIT (PMID:15295017). For some cellular components, a high degree of purification may be achieved with a relatively simple, one-step protocol e.g. separating plasma from plasma cells by centrifugation. Techniques, such as principal component analysis can allow for good results to be achieved from simple enrichments. For some cellular components, a high degree of purity may be difficult to achieve. In such cases there may be a sequence motif that may act as a good computational handle to pull out the most likely candidates. Plasma membrane protein purification is an example where the hydrophobicity of the sample is problematic and using a transmembrane prediction tool to reduce the list to those proteins likely to be integral to the membrane may substantially reduce false positives. In these cases the RCA (inferred from Reviewed Computational Analysis) evidence code should be used (http://geneontology.org/page/rca-inferred-reviewed-computational-analysis). If this analysis was performed outside of the publication, then a GO Reference should be made to outline the protocol used.

RLovering commented 6 years ago

Hi Helen and all

Rachael and I had a long talk about this and I wanted to make the following comments:

  1. how do you decide when something is a purification or an enrichment. ie what level of contamination is acceptable for purification?

  2. I have a big problem with this HTP business. For example if someone does a co-IP with LRRK2 and they specifically look for 3 proteins on a western then we are happy to call this an IDA. If you do a MS on this sample you may get 450 proteins associated with LRRK2 (this is how many are listed in IntAct), and if you silver stain a gel (duplicate) as well as western, (with many different co-IPs this is true) you will get a smear of proteins visible. HTP potentially gives a much better idea of what is in your co-IP than westerns and yet we are down-grading MS because we see too much!

This comes back again and again to the idea of author intent. If the author has reasonable grounds to expect to see a result then we can annotate it (unless it is plasma membrane-see so many curators objecting to author intent for this CC domain) to the term using IDA, if the author is fishing then it is HTP or worse RCA. (no longer sure which is worse HTP or RCA?)

  1. Many of the ECM papers we have annotated have been produced by our grant holder Manuel Mayr. He is a BHF professor at Kings and has spent considerable time improving the methods he uses to extract only ECM proteins. Note that he also assesses the proteins wrt different protein types (eg collagens, proteoglycans, scaffold proteins etc and what percentage of these different classes are present in the matrix v the ECM 'space'. Note that he also points out that newly synthesised matrix proteins will appear in both lists because they have not yet been incorporated into the matrix. http://circ.ahajournals.org/content/circulationaha/125/6/789/F2.large.jpg When the protein lists Mayr identifies in the ECM are compared with the matrisome tool 90-100 % of the protein align with this tool. In cases where this is slightly lower I think that the tool is actually missing data rather than Manuel's list being wrong.

  2. I think comparing these lists with the matrisome tool is a good idea but I think we need to not have a hard and fast rule here, how about:

    i: where the alignment is very poor, only annotate those proteins that are present in matrisome and the expt with evidence RCA

    proteins which are not on the matrisome list not annotated. ii: where the alignment is very good annotate those proteins that are present in matrisome and the expt with evidence HTP contact Matrisome providers and ask them to consider adding the non-aligned proteins to their list or look at the proteins and use curator judgement to decide if the non-aligned proteins are likely or not to be in this compartment

  3. The other problem with the Matrisome tool, that I think we need to recognise, is this is not 'computational' like the membrane domain because of the complexity of the ECM. It is in effect a summary of proteins that have appeared in multiple papers in the ECM, or in some cases only appeared once but seem to be good candidates for this location due to similarity with other ECM proteins (perhaps based on paralogs). So again author judgement is having an impact on the protein list. Why is this groups judgement considered more valid than that of Mayr?

Ruth

rachhuntley commented 6 years ago

@RLovering, I've copied some of your comments to the related thread on ECM proteins #1796, as they are applicable to both threads.

hattrill commented 6 years ago

We decided not to do IPI with HTP codes and leave that top the expert database than can capture the more specific details and confidence levels.

hattrill commented 6 years ago

Summary of VC:

Q. proteomics experiments and FDR/unique peptide exceptions If other methodologies have been used to further refine the final list, allow curator discretion as to whether this increases the validity of the hits enough for annotation. Other stats can be used e.g. Potentially, other measurements may be used, e.g. anything with a Bayes factor greater than 10 will have “strong evidence”, fold change >30% ? (e.g. PMID:24006456) Action point: Examine this and summarize in documentation that FDR and >1 unique peptide are not hard rules, but guidance that can be applied to most modern MSMS experiments. Make unique peptide exceptions text more prominent. Clarify that FDR refs to peptide, rather than protein (or at least that is what most exps state).

Q. proteomics experiments: purity issues

Text ok to incorporate up until RCA.

Need to review RCA. Was originally only intended for use with publications where further validation done computationally. Should be ok for HTP paper that do this analysis, but guidance says that it should be reviewed each year, which sounds strange for publication-traceable annotations. Note: Most RCAs are much greater than one year old (as far back as 2004, Rachael has found). Kimberly opened ticket for RCA review. https://github.com/geneontology/go-annotation/issues/1801

If we want to use RCA for analysis of HTP by ourselves then we need to do some thinking.

Kimberly has crystallized the issues in https://github.com/geneontology/go-annotation/issues/1796

“I like the suggestion of using an external validation tool to perhaps help with annotating HTP data. However, I think if we're going to pursue this, here are some questions we'd probably want to answer: What tool(s) or algorithms can be used? Are these tool(s) or algorithms maintained and updated based on new experimental data? Is the output of the tool manually reviewed by a curator? Is there a mechanism for curators to provide feedback on the tool(s) or algorithms? Is there a reviewed publication for the tool(s) or algorithm that validates its predictions? Will we need a new GO_REF for each combination of an HTP paper and an in silico method? Exactly what evidence code to use is still a question for me, but something like RCA is probably the best fit.”

tberardini commented 5 years ago

Why is this issue still open? Further action needed?

hattrill commented 5 years ago

Think we can close now.