geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Proposal for curating extracellular matrix proteins from proteomics experiments #1796

Closed rachhuntley closed 6 months ago

rachhuntley commented 6 years ago

The proposal that came out of #1773 regarding curating extracellular matrix proteins from proteomics experiments is detailed below.

rachhuntley commented 6 years ago

@hattrill comment:

Hi @rachhuntley, @RLovering I've been doing some thinking about the ECM issues discussed in #1655 and I would like you to think about the following suggestion:

Looking specifically at the ECM issue, I agree that the annotation of ECM components to specific tissues is valuable to researchers.

There are a number of issues:

  1. Defining the ECM. In GO: GO:0031012 extracellular matrix. A structure lying external to one or more cells, which provides structural support for cells or tissues. So, by definition, this should not include non-structural components.

  2. The ECM is quite sticky - lots of HS/GAGs and most of the protein components are pretty insoluble. This presents two problems for purification/MS experiments:

  1. All purification methods that I have looked at describe enrichment rather than purification. This will probably always be the case for ECM preps. The hits we see from the MS experiments are "true" but what do they represent?

I have a suggestion: There is a project, http://matrisomeproject.mit.edu/ (PMID:26163349) that had been initiated to help deal specifically with the issues pertaining to ECM profiling (note: these guys are scathing about GO annotation of ECM). I think that it would be quite instructive to get these people involved or at least use their resources. The issues they highlight with ECM preparation is that most of the components are insoluble and some very insoluble - hard to shift either by SDS or urea. The soluble components tend to be remodelling factors, signalling molecules, etc. that don't really constitute the ECM proper. So, they have developed a pipeline (PMID:28675934) that essentially attempts to remove as much cellular contaminants as possible, leaving the insoluble enriched ECM component (note: enriched, not purified). Then they do some funky stuff to optimize MSMS and end up with an utterly filthy list of proteins! They then use in silco analysis to sort the set into "contaminants" and "matrix" components. This allows them to compare different ECM profiling experiments to a similar standard.

PMID:28675934 "Scripts To Annotate Matrisome Proteins and Calculate Mass-Spectrometric Metrics To facilitate the annotations of matrisome proteins in large data sets, we developed a script called “Matrisome Annotator”. Providing that a data set contains Entrez or HUGO gene symbols for each entry, the script will return an output file in which each entry will be annotated as being part of the matrisome or not and will be tagged with matrisome division (core matrisome vs matrisome-associated) and category (ECM glycoproteins, collagens, proteoglycans, ECM-affiliated proteins, ECM regulators, or secreted factors). “Matrisome Annotator” can be used to annotate not only proteomic data but also any kind of list of genes/proteins. We also developed a second script, called “Matrisome Analyzer”, that calculates the proportion of ECM content in terms of number of spectra, number of unique peptides, number of proteins and peptide intensity (i.e., protein abundance) in proteomics data set. This script allows rapid evaluation of the abundance of matrisome versus non-matrisome proteins in any given data set input as a delimited text file and exports the calculation in tables and graphs. Both scripts are available as webtools and to download under the Analytical Tools section of the Matrisome Project Web site (http://matrisome.org/)."

This seems a potentially good way forward, especially if you are aiming to systematically capture ECM profiling experiments. The website contains a large number of datasets already curated.

I think that the RCA evidence code (http://geneontology.org/page/rca-inferred-reviewed-computational-analysis) fits this pipeline perfectly, probably with a specific GO_REF.

The only reservation I have about RCA is that it has a note saying "Note: Annotations using the RCA code should be reviewed after one year, any older than this date will be deleted.". That sounds like a rule that needs to go.

Well, let me know what you think. I could be a plausible solution for other datasets that have troublesome contaminants e.g. plasma membrane preps but can be combined wth computational analysis (e.g. TMpred) to yield a higher quality set for annotation.

rachhuntley commented 6 years ago

Rachael Huntley comment:

Hi Helen,

This sounds like an interesting proposition, however I would like to investigate this a bit more and I'm not sure this will be done before the call next week.

I've had a quick look at the matrisome tool and put a couple of our datasets through and I can already see a few proteins that are specifically mentioned in the paper as being expected in the ECM, e.g. glycoproteins, but they aren't coming up in the tool. So, just to be clear, are you suggesting we only annotate to ECM the proteins that are categorised by this tool and not annotate those that are not categorised?

What I'd like to do is do this more thoroughly and, 1. speak to our ECM expert to see if he agrees that this is a sensible approach with regards to his datasets (his data actually fairs quite well in the tool) and 2. try to contact the matrisome people to see if they could look at the ECM studies we have and incorporate any proteins in these lists that they believe are missing from their database (they say they are open to additional suggestions and will update frequently).

However, looking at the output of the tool, we may have to consider an additional GO term to cover ECM-associated proteins. As the matrisome people say "We believe that the “core matrisome” categories (collagens, proteoglycans and ECM glycoproteins) are robust and not likely to change much with further analyses, at least for mammals and probably other vertebrates (other taxonomic groups clearly do contain additional ECM proteins). However, the “matrisome-associated” categories (secreted factors, regulators and affiliated proteins) are, by their nature, less firmly established and we suspect that they may well evolve in light of subsequent analyses. These latter categories were deliberately “inclusive” — although many proteins within those categories undoubtedly do bind reproducibly to ECM, others may not (see Fig. 1). Our aim was to define categories that would capture all candidate components of the ECM."

So, if we were to take this approach, we should separate the core-ECM proteins from the associated-ECM proteins and try to improve the GO representation of this component. How about having a parent to the existing term called "extracellular matrix region" and then the existing ECM term could be restricted to the core ECM proteins (collagens, proteoglycans and ECM glycoproteins)?

Also, I think the existing definition of ECM should be revised so it's not limited to structural proteins. Here are a couple of definitions I found that say it's more than structural: "The extracellular matrix (ECM) is a collection of extracellular molecules secreted by support cells that provides structural and biochemical support to the surrounding cells." (Wikipedia)

"The extracellular matrix (ECM) is the non-cellular component present within all tissues and organs, and provides not only essential physical scaffolding for the cellular constituents but also initiates crucial biochemical and biomechanical cues that are required for tissue morphogenesis, differentiation and homeostasis." (http://jcs.biologists.org/content/123/24/4195)

As for RCA, I think this might work (I can't speak for @RLovering though, but it's worth noting that we can ISS from RCA in Protein2GO - for the piggy proteins), even if we keep the 'reviewed every 12 months', as it's a fairly simple job to run these lists through the matrisome tool to see if anything has changed - assuming the tool continues to be maintained, of course.

I will update you when I've investigated this a bit more.

rachhuntley commented 6 years ago

@hattrill comment:

Yes, don't expect you have to have looked over it and made a decision before the call as it does need a bit more investigation to see if it would work for the data you have. As they seem to be setting themselves up as the gatekeepers of ECM, would be good to make sure that their list is as inclusive as possible. I would definitely consult with the ECM community and get their input. I guess that their "core ECM" list would be the structural part that you could annotate with "extracellular matrix component" and then perhaps the "matrisome-associated" component could captured by another term - "extracellular matrix binding"? "a new term".

I think that you are correct in saying that the current structure in GO isn't very satisfactory. 'extracellular matrix component' and 'extracellular matrix' seems somewhat synonymous with their current definitions. A parent term "extracellular matrix region" or even matrisome (!) could perhaps capture soluble and structural components.

RLovering commented 6 years ago

I agree that improvements could be made to this part of the ontology. I would rather we added 'matrisome' terms as synoyms for the ECM rather than as the term names.

I like Rachael's idea of:

new term: extracellular matrix region

GO:0031012 extracellular matrix renamed: GO:0031012 extracellular matrix core new term extracellular matrix core associated

Ruth

rachhuntley commented 6 years ago

(I'm adding comments from @RLovering from related ticket #1773 as they are directly relevant to this thread.)

Many of the ECM papers we have annotated have been produced by our grant holder Manuel Mayr. He is a BHF professor at Kings and has spent considerable time improving the methods he uses to extract only ECM proteins. Note that he also assesses the proteins wrt different protein types (eg collagens, proteoglycans, scaffold proteins etc and what percentage of these different classes are present in the matrix v the ECM 'space'. Note that he also points out that newly synthesised matrix proteins will appear in both lists because they have not yet been incorporated into the matrix. http://circ.ahajournals.org/content/circulationaha/125/6/789/F2.large.jpg When the protein lists Mayr identifies in the ECM are compared with the matrisome tool 90-100 % of the protein align with this tool. In cases where this is slightly lower I think that the tool is actually missing data rather than Manuel's list being wrong.

I think comparing these lists with the matrisome tool is a good idea but I think we need to not have a hard and fast rule here, how about:

i: where the alignment is very poor, only annotate those proteins that are present in matrisome and the expt with evidence RCA

    proteins which are not on the matrisome list not annotated.
    ii: where the alignment is very good annotate those proteins that are present in matrisome and the expt with evidence HTP
    contact Matrisome providers and ask them to consider adding the non-aligned proteins to their list
    or look at the proteins and use curator judgement to decide if the non-aligned proteins are likely or not to be in this compartment

The other problem with the Matrisome tool, that I think we need to recognise, is this is not 'computational' like the membrane domain because of the complexity of the ECM. It is in effect a summary of proteins that have appeared in multiple papers in the ECM, or in some cases only appeared once but seem to be good candidates for this location due to similarity with other ECM proteins (perhaps based on paralogs). So again author judgement is having an impact on the protein list. Why is this groups judgement considered more valid than that of Mayr?
hattrill commented 6 years ago

Responding to comments on #1773:

If Mayr's has made extensive efforts to purify ECM compoments and gets 90% in the matrisome tool then it should be a validation of his approach - fine with that to be a HDA. Looks like matrisome and Mayr align well. I note that the Matrisome tool is not a perfect list, but it was partly developed because of frustrations with the existing GO annotation. With the example dataset that we discussed, there were considerable quaestionable hits - if this dataset only gets 30% in the matrisome tool and Mayr gets 90%, then we have to use some method to reconcile these differences.

Ultimately, it is up to you to look at the standards for validating ECM preps if you are intending to do a lot of work here. I think that the above proposal is fine.

vanaukenk commented 6 years ago

I like the suggestion of using an external validation tool to perhaps help with annotating HTP data. However, I think if we're going to pursue this, here are some questions we'd probably want to answer:

What tool(s) or algorithms can be used? Are these tool(s) or algorithms maintained and updated based on new experimental data? Is the output of the tool manually reviewed by a curator? Is there a mechanism for curators to provide feedback on the tool(s) or algorithms? Is there a reviewed publication for the tool(s) or algorithm that validates its predictions? Will we need a new GO_REF for each combination of an HTP paper and an in silico method?

Exactly what evidence code to use is still a question for me, but something like RCA is probably the best fit.

rachhuntley commented 6 years ago

This is exactly what I'm doing for the matrisome tool. I'm running our proteomic datasets through it and any missing proteins that we think should be included I will contact the developers asking them to assess their eligibility (they use a combination of computational methods and manual curation). This way we'll get an idea of whether the tool is regularly maintained and if they are willing to update it based on our requests. For this case, I would be able to make a GO_REF that covers use of the matrisome tool to review multiple papers.

rachhuntley commented 6 years ago

I've made a project for reorganising the extracellular matrix branch here https://github.com/geneontology/go-ontology/projects/9 Please add further comments and suggestions.

bmeldal commented 6 years ago

Have you consulted with MatrixDB, the DB for ECM interactions and complexes and an IMEx partner? Sylvie is always very keen to help getting the ECMs right. I still have a list of complexes from her to curate as they are so, so trick to define!!!

rachhuntley commented 6 years ago

Thanks Birgit, do you have an email or GitHub handle for her?

bmeldal commented 6 years ago

Sylvie Ricard-Blum: sylvie.ricard-blum AT univ-lyon1.fr

RLovering commented 6 years ago

Hi I have just looked at the annotations associated with PMID:20551380 https://www.ebi.ac.uk/QuickGO/annotations?reference=PMID:20551380&taxonId=9606&taxonUsage=descendants This paper is by our co-grant holder Manuel Mayr who is very well respected in the proteomics field. It is important to capture this data which confirms the aortic location of these ECM proteins.

I have compared these to the matrisome list of ECM proteins. Of the 84 ECM proteins 70 are on the matrisome list. I have looked at GO and UniProt records for the remaining 14 proteins and 12 of them have information that suggests ECM annotations are appropriate (I have just pasted small amount of text below so this does not necessarily make sense to everyone). For the remaining 3 proteins, 2 have not much information and 1 is a transcription factor so I think it should be deleted, or moved to PRIV in case new data emerges.

I have put this data below. I am therefore changing all of the annotations to HTP codes, and putting the 3 described above as PRIV (ie effectively deleting). I realise that this does not conform to the guidelines but I think this demonstrates the limitations of the matrisome tool

Hope this is helpful

Ruth

UniProt ID HGNC comment
P02743 APCS Amyloid fiber formation.
P02749 APOH Binds to various kinds of negatively charged substances such as heparin, phospholipids, and dextran sulfate
P27918 CFP A positive regulator of the alternate pathway of complement. It binds to and stabilizes the C3- and C5-convertase enzyme complexes
P10909 CLU extracellular chaperone
P23946 CMA1 extracellular matrix degradation
Q8N436 CPXM2 not much detail in UniProt/GO May be involved in cell-cell interactions - delete
P52943 CRIP2 no data in UniProt/GO - Delete?
P59665 DEFA1 Defensins are thought to kill microbes by permeabilizing their plasma membrane
Q08380 LGALS3BP Promotes integrin-mediated cell adhesion, synonym Basement membrane autoantigen p105
Q9NZU5 LMCD1 Transcription factor - DELETE
P24158 PRTN3 Serine protease that degrades elastin, fibronectin, laminin, vitronectin, and collagen types I, III, and IV
P08294 SOD3 Extracellular superoxide dismutase
Q9BQB4 SOST Heparin-binding, Negative regulator of bone growth
O43294 TGFB1I1 molecular adapter coordinating multiple protein-protein interactions at the focal adhesion complex and in the nucleus. Links various intracellular signaling modules to plasma membrane receptors and regulates the Wnt and TGFB signaling pathways. May also regulate SLC6A3 and SLC6A4 targeting to the plasma membrane hence regulating their activity.
Q15661 TPSAB1 Tryptase is the major neutral protease present in mast cells and is secreted upon the coupled activation-degranulation response of this cell type. May play a role in innate immunity. Isoform 2 cleaves large substrates, such as fibronectin
RLovering commented 6 years ago

I have also just looked at the extracellular space annotations associated with the paper as listed above PMID:20551380 https://www.ebi.ac.uk/QuickGO/annotations?reference=PMID:20551380&taxonId=9606&taxonUsage=descendants

I have compared these to the matrisome list of ECM proteins. Of the 58 EC space proteins 42 are on the matrisome list. I have looked at GO and UniProt records for the remaining 16 proteins and 13 of them have information that suggests EC space annotations are appropriate (I have just pasted small amount of text below so this does not necessarily make sense to everyone). For the remaining 3 proteins, 1 has not much information, 1 involved in processing in ER and interacts transiently with almost all of the monoglucosylated glycoproteins and 1 is a transcription factor so I will move these 3 to PRIV in case new data emerges.

I have put this data below. I am therefore changing all of the annotations to HTP codes, and putting the 3 described above as PRIV (ie effectively deleting). I realise that this does not conform to the guidelines but I think this demonstrates the limitations of the matrisome tool

Hope this is helpful

Ruth

UniProt ID HGNC comment
P02743 APCS Amyloid fiber formation.
Q9Y646 CPQ Carboxypeptidase that may play an important role in the hydrolysis of circulating peptides
Q9NZU5 LMCD1 DELETE Transcription factor
P27797 CALR DELETE? Calcium-binding chaperone that promotes folding, oligomeric assembly and quality control in the endoplasmic reticulum (ER) via the calreticulin/calnexin cycle. This lectin interacts transiently with almost all of the monoglucosylated glycoproteins that are synthesized in the ER.
Q8N436 CPXM2 Delete? not much detail in UniProt/GO May be involved in cell-cell interactions
P10909 CLU extracellular chaperone
P08294 SOD3 Extracellular superoxide dismutase
P00450 CP ferroxidase activity oxidizing Fe2+ to Fe3+ without releasing radical oxygen species. It is involved in iron transport across the cell membrane
P51858 HDGF growth factor, Heparin-binding protein
P02647 APOA1 plasma
P06727 APOA4 plasma
P05090 APOD plasma
P02649 APOE plasma
P02749 APOH plasma
P07602 PSAP protein cleaved into multiple products, some with enzyme functions and some act as growth factors
Q15661 TPSAB1 Tryptase is the major neutral protease present in mast cells and is secreted upon the coupled activation-degranulation response of this cell type. May play a role in innate immunity. Isoform 2 cleaves large substrates, such as fibronectin
rachhuntley commented 6 years ago

Hi @RLovering,

This is very useful and I think it is what we proposed we would do for a dataset that performs well in the matrisome tool.

When you say "limitations of the matrisome tool", do you mean that not all of the proteins are found in the tool? If so, we agreed we would contact the group and ask if they could include the missing proteins. I have asked them to do this with another of Manuel's proteomics sets, they said they would look but haven't got back to me yet. Hopefully we can rely on them to do this. The difference in coverage seems to be down to the types of ECMs they analyse, so it's not going to be complete yet.

hattrill commented 6 years ago

Yes, certainly seems a good way to valid the dataset. Given that it's a hand-built list, not bad. This is also quite a good way of seeing where the holes are and hopefully they will be receptive to looking into adding these classes of proteins.

RLovering commented 6 years ago

Yes, it would be great if the tool would add these to their dataset. You'll have to update me on progress on this Rachael and whether you made the decision about which ECM 'term' these should be associated with in their database or whether you left this for them to decide.

I guess I was also just pointing out this missing data and also how to incorporate this issue into guidelines, plus will curators want to make these decisions and send data to matrisome.

Ruth

rachhuntley commented 6 years ago

So I think the "core matrisome" proteins are pretty much known, it's the associated proteins that can vary, depending on cell type etc. So most of the missing proteins will be core-associated - but we don't have the terms yet. If the matrisome group agree to add proteins, then it should be included in the guidelines for curation of ECM proteins from proteomics experiments that this is what should be done (run through the tool and then contact the group for missing proteins). There aren't many papers such as this that have been annotated so far (only by BHF-UCL).

rachhuntley commented 5 years ago

Hello, I’ve annotated a few ECM proteomics papers now and thought we should finalise these guidelines. I’ve been in contact with the Matrisome tool people, Alexandra Naba and Richard Hynes, and told them we have improved the ontology around ECM and also annotated some of their papers. Alexandra has asked whether we could annotate some of their studies on ECMs from other species.

They have one study already published for zebrafish @doughowe (https://www.sciencedirect.com/science/article/pii/S0945053X17301555) and have two more coming out for C. elegans @vanaukenk and Drosophila @hattrill, so it would be good if these could be annotated as well.

What do you think about these guidelines?

Guidelines for curating extracellular matrix Cellular Component from proteomics papers.

Consider the guidelines for high-throughput papers (add link) before annotating an extracellular matrix proteomics paper. Note: the isolation techniques for ECM proteins are less stringent than isolation of internal cellular components; generally the isolation usually requires only decellularization, enrichment and solubilization of ECM proteins before mass-spectrometry to identify the proteins present. Often a three-step extraction protocol is used: NaCl (to isolate extracellular space proteins), SDS (removal of cellular proteins), Gu-HCl (to isolate extracellular matrix proteins).

If the experimental techniques are considered high quality, use the Matrisome Annotator (http://matrisomeproject.mit.edu/analytical-tools/matrisome-annotator/) or MatrisomeDB (http://matrisomeproject.mit.edu/proteins/) to categorise your list of proteins (using gene symbols) into ECM categories, as follows;

Core matrisome:

Matrisome-Associated:

If a high proportion of the protein list is identified as being in any of the ECM categories by the Matrisome tool (>70%), then all of the proteins in the list can be annotated to GO:0031012 extracellular matrix or GO:0062023 collagen-containing extracellular matrix (for Metazoan proteins) using the HDA evidence code. The curator should use their judgement and not annotate proteins that are probably contaminants and so unlikely to be located in the ECM. However, it is recommended that the curator contact the Matrisome developers to suggest including the proteins that were not recognised by the tool, providing details of the paper that describes the proteomics experiment. This is requested by the developers themselves: http://matrisomeproject.mit.edu/ecm-atlas/. Please contact matrisomeproject@gmail.com.

Where the alignment is very poor (<70%), annotate only those proteins that are identified in the Matrisome Annotator. The curator may use their judgement on individual proteins that weren’t identified by the tool as to whether they are likely to be present in the ECM.

Proteins in the following categories should additionally be annotated with a Molecular Function term (also HDA evidence) as follows;

bmeldal commented 5 years ago

I haven't added the "extracellular matrix structural constituent..." terms to our ECM complexes except for 3 collagens. I must have missed them as I struggle with the "structural constituent of..." type terms in the MF. Is there a way of finding missing annotations beyond looking at the protein groups mentioned above?

hattrill commented 5 years ago

Thanks, Rachael. To me 70% seems a bit low, but you've been working with the data and so have a much better feel for it and you have to set it at a reasonable level for that field so that you can annotate it - so if you are sure, then we can go with that.

It would probably be good to go over what outliers there are in some of the good sets. I wonder if there are some common contaminants that come out?

GO:0005201 'extracellular matrix structural constituent' seems ok to infer as you've outlined but the more specific molecular function terms "....conferring elasticity" etc; seem a bit of a stretch (no pun intended) from the localization - these seem like more of an more of an ISS/ISM to me.

rachhuntley commented 5 years ago

Hello, Thanks for your responses. Regarding the MF terms, I think there has always been a problem with finding evidence to annotate these terms, as these types of experiments are not done. Ideally we would have a GO_REF explaining how these annotations were made, i.e. combination of proteomics localization plus verification from the Matrisome tool. Also, I would be happier with IC rather than an ISS/ISM - as what is the similarity from? But then the problem with an IC is that we can't transfer them. Would you consider an HDA code with a GO_REF plus PMID (if this were possible)?

Regarding the 70%, I've had another look at the datasets and remembered that my Matrisome tool contact, Alexandra, had agreed to add some of the proteins that were identified in the proteomics experiment to the matrisome list. For the ones she did not agree were ECM-related, she added comments for explaining why. Based on these comments I will go through and delete the annotations I have made to the proteins she doesn't believe are in the ECM. If curators can do the same for other datasets, then I think this will be the safest way to go.

So I can change the guidance to: "Annotate only those proteins that are identified as being in any of the ECM categories by the Matrisome tool to either GO:0031012 extracellular matrix or GO:0062023 collagen-containing extracellular matrix (for Metazoan proteins) using the HDA evidence code. The curator may use their judgement on individual proteins that weren’t identified by the tool as to whether they are likely to be present in the ECM. It is highly recommended that the curator contact the Matrisome developers to suggest including the proteins that were not recognised by the tool, providing details of the paper that describes the proteomics experiment. This is requested by the developers themselves: http://matrisomeproject.mit.edu/ecm-atlas/. Please contact matrisomeproject@gmail.com. If the developers agree to add any of the proteins to the matrisome list, the curator may then annotate those proteins to GO:0031012 extracellular matrix or GO:0062023 collagen-containing extracellular matrix (for Metazoan proteins)."

This is not really 'cherry-picking' but rather taking expert advice. What do you think?

Birgit, you could look for complexes that have been annotated to GO:0031012 extracellular matrix or GO:0062023 collagen-containing extracellular matrix. I haven't really thought how to deal with complexes and the MF term. Are there complexes with mixtures of those type of proteins?

hattrill commented 5 years ago

The change to the guidance seems good to me. I think that it's fine to use expert advice - many of us remove annotations that are not supported by current thinking, even if older papers support - I see this as a very similar practice. As you use P2GO, it can be clearly stated that this is what was done for future curators to see.

As for the more expansive MF terms - I wonder if this is a good case for using a NAS/TAS evidence code - perhaps attaching to a review or GO_REF to matrisome guys - otherwise you might keep on making these based on the ECM localization studies, but they do seem to be more biophysical properties. That way you can label these protein classes with their exciting MFs, without implying a biophysical experiment underlying the statement. If these are associated with very defined domains, InterPro2GO might also be good solution - and the cross-species coverage would be "banging".

bmeldal commented 5 years ago

Ok, I have 144 complexes annotated to GO:0031012 extracellular matrix or child term and 42 of them don't have an annotation to GO:0005201 extracellular matrix structural constituent or children. I'll go through the list and add the appropriate terms.

rachhuntley commented 5 years ago

I don't think InterPro2GO will help here as not all of the components will have identifiable functional domains. Only "extracellular matrix constituent, lubricant activity" currently has an InterPro mapping for the mucin proteins.

For the MF terms I think these are really an IC. How about this? We create a GO_REF to be used with an IC annotation for the MF terms, the GO ID in the 'with' field would be the ECM component term. The GO_REF will describe the process of verifying the ECM components with the Matrisome tool, which assists the curators in categorising the ECM components by their functions. The component terms can be transferred to other species as these are HDA evidence, then the MF terms could be added to the ECM gene products in the other species by IC'ing from the ISS annotations. The only problem I see with this at the moment is that in Protein2GO, when you IC from an ISS annotation, it adds GO_REF:0000111 (Gene Ontology annotations Inferred by Curator (IC) using at least one Inferred by Sequence Similarity (ISS) annotation to support the inference), so we would need to make sure that the new GO_REF could be accepted instead.

hattrill commented 5 years ago

My big reservation is that these are pretty hardcore biophysical properties - would it not be better to look for the experimental data for the biophysical MF - we only have a handful of LTP exp codes supporting these - I would first ask, what is the evidence for "lubricant activity" and document that.

IMO these assertions are based purely on sequence characteristics, so I think that a new GO_REF with ISM evidence code is better.

Your IC solution is certainly better than an experimental code. What I think would be bad is that everytime there is an CC ECM-HDA-matrisome alignment, there is a biophysical MF - the structural constitutent parent is about as confortable as I would be making this assertion.

@vanaukenk what do you think?

bmeldal commented 5 years ago

Complexes updated.

While I don't use HT I agree that I wouldn't go further than GO:0005201 extracellular matrix structural constituent with some form of IC (I used ECO:0005547, the complex-specific IC term) unless I have proper experimental evidence of their structural activity, which I had for a few complexes where I could use IDA or the complex-specific ISS (ECO:0005610 or children).

rachhuntley commented 5 years ago

What I think you are missing is that the categorizations that the matrisome tool makes is based on a whole lot of evidence from experimental and in silico calculations, including domain-based evidence, see Supp. figure 1 in this paper that describes the pipeline https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322572/ (it's the first supplemental file link). I don't see how this is worse than an InterPro prediction?

hattrill commented 5 years ago

screen shot 2018-10-16 at 11 35 12

I can see the justification for the top level MF term, but I can't see these associations: Elastin and Emilin - elasticity Proteoglycans - compression resistance Collagens - tensile strength Mucins - lubricant activity

rachhuntley commented 5 years ago

So, the matrisome paper doesn’t talk about the functions of the groups of proteins, the function is known from other evidence, e.g. For collagens I would annotate to "extracellular matrix structural constituent conferring tensile strength” because this is their function (see http://jcs.biologists.org/content/131/7/jcs203950.long "Disease pathogenesis typically involves genetic alterations of the triple helix, a unique structure that is a hallmark feature common to all collagens. The triple helix bestows exceptional mechanical resistance to tensile forces and a capacity to bind a plethora of macromolecules.”). For proteoglycans I would annotate "extracellular matrix structural constituent conferring compression resistance” because this is their function (see https://www.ncbi.nlm.nih.gov/pubmed/8346704 “The major biological function of proteoglycans derives from the physicochemical characteristics of the glycosaminoglycan component of the molecule, which provides hydration and swelling pressure to the tissue enabling it to withstand compressional forces.” Note, the definition of the 'compression resistance' GO term is currently "A constituent of the extracellular matrix that enables the matrix to resist compressive forces; often a proteoglycan.”. I would like to add similar comments and supporting references to the other ECM MF terms to make this connection between the terms and the groups of proteins clearer.

The matrisome tool provides the groupings of the proteins, based on domain, experimental etc. evidence, and then the curator makes the connection between the grouping and the function using published knowledge about the function of these protein groups. This is why I would like to use IC for the evidence for the MF terms.

We don't have experimental evidence here that these proteins are structural, but you are comfortable annotating to the structural consitutent term. Is it because of what is already known about ECM proteins? I see this as the same justification for the more specific MF terms but using more detailed background knowledge.

hattrill commented 5 years ago

Perhaps we should not imply an MF from these experiments at all - I am comfortable with that than anything else.

rachhuntley commented 5 years ago

Apart from the annotations I’ve made to these more detailed terms, there are only a handful of manually assigned annotations. The only ones that I can see that have convincing evidence for a structural role are those involved in elasticity. Curators have long had a problem with finding appropriate evidence for structural constituent terms, but for whatever reason, we kept these terms. I think the cumulative evidence I have for an IC is just as strong as the evidence from InterPro for an IEA for these terms and other groups have made IC annotations to these terms based on cellular component annotations. I think we either have to remove these terms from the ontology or come up with some strong guidance as to when we can annotate to them, and with what evidence codes, as it’s not at all clear to me.

hattrill commented 5 years ago

I'd certainly advocate getting rid of the very specific terms:

GO:0030197 | extracellular matrix constituent, lubricant activity GO:0030021 | extracellular matrix structural constituent conferring compression resistance GO:0030023 | extracellular matrix constituent conferring elasticity GO:0030020 | extracellular matrix structural constituent conferring tensile strength GO:0150043 | structural constituent of synapse-associated extracellular matrix

as I don't see that these as distinct molecular functions or molecular functions specifically confined to the ECM.

I really don't know about GO:0005201 extracellular matrix structural constituent - it's fairly vague. I am not sure whether users find this useful or, as GO-CAMs are function-centric, that every protein should have an exciting (ie non-protein binding) MF. The structural molecular activity branch as a whole is quite vague.

bmeldal commented 5 years ago

@hattrill

I really don't know about GO:0005201 extracellular matrix structural constituent - it's fairly vague. I am not sure whether users find this useful or, as GO-CAMs are function-centric, that every protein should have an exciting (ie non-protein binding) MF. The structural molecular activity branch as a whole is quite vague.

Absolutely. These terms never sat well with me (I raised a ticket years ago but can't find it anymore as it was on SF). I have just added them to all ECM complexes as per Rachel's suggestion, but only GO:0005201 extracellular matrix structural constituent, not the child terms.

RLovering commented 5 years ago

Hi in the GOC meeting Pascale has pointed out that we should aim to provide a MF term for each gene product (not just a BP annotation). I think that the child terms are providing important information, it seems odd to me to group all these proteins within the same MF term, which in many ways isn't saying much more than the cellular component statement. I was thinking that potentially 'extracellular' could be removed from the MF terms but most of these functions only occur in the extracellular region. I think these terms should be kept. I will try to discuss at GOC over the next couple of days Ruth @pgaudet

rachhuntley commented 5 years ago

This is the ticket from @bmeldal https://github.com/geneontology/go-ontology/issues/10895

hattrill commented 5 years ago

We have: GO:0097493 structural molecule activity conferring elasticity

So we could have: structural molecule activity conferring tensile strength structural molecule activity conferring compression resistance structural molecule activity with lubricant activity (not sure about phrasing - sounds silly whichever way I say it)

That seems quite nice - I think that most of our annotation pipelines keep us out of the biophysical arena, so it would be good to test these in the field.

mah11 commented 5 years ago

@hattrill ... I've thought exactly that for a decade and a half ("years" as of https://github.com/geneontology/go-ontology/issues/2284). Almost everything under "structural molecule activity" should go, and the elasticity etc. ones should lose the ECM specificity. But I also stopped trying to die on that particular hill a while ago; glad to see the torch passed along.

rachhuntley commented 5 years ago

Coming back to the question of evidence code for the MF annotations (regardless of whether we keep the ECM-specific ones or not), I'm going to come back to the idea of having them as RCA again, which was suggested by Helen originally in this thread.

Just as a reminder, the way I’m annotating the MF for ECM proteins is by basing it on all of the following:

  1. experimentally verified location (ECM)
  2. validation/categorisation with the Matrisome Annotator tool, which combines in silico and experimental in vivo information about extracellular matrix proteins to put proteins into specific ECM categories (e.g. collagens, proteoglycans)
  3. background knowledge that, for example, collagens confer tensile strength (I'm hoping to add PMIDs to the GO terms which contain these statements).

This approach seems to fit with the definition of combinatorial evidence in ECO. We can easily review these periodically by running the protein lists through the tool every so often, but it may be that this restriction removed from RCA anyway (see https://github.com/geneontology/go-annotation/issues/1581).

However, instead of using the ECO term that is directly mapped to RCA (ECO:0000245 computational combinatorial evidence used in manual assertion), I would like to request a more specific one that mentions the matrisome tool. This would still map up to the GO evidence RCA.

How does this sound?

hattrill commented 5 years ago

I think that sounds like a good idea, it is certainly more transparent and describes the multi-tiered approach.

pgaudet commented 6 months ago

Out of date