geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
31 stars 10 forks source link

Queries of GO graph store for helping assess new annotation qualifier usage #1542

Open vanaukenk opened 7 years ago

vanaukenk commented 7 years ago

From the 2017-03-14 annotation call: http://wiki.geneontology.org/index.php/Annotation_Conf._Call_2017-03-14#Minutes

It would be helpful for curators from the contributing groups to have a list of specific types of BP annotations for review:

  1. Annotations to a 'regulation ' BP term when no MF annotation exists Note here that we may want to refine the MF annotation criteria to also include terms like GO:0046872 'metal ion binding', for example, if MF terms like that are deemed uninformative for assessing whether a regulation annotation is appropriate

  2. Annotations from the same gene product to both a process term and regulation of that process

On the call it was suggested that Mary and Eric could work on this together - don't see either of them on the assignee autocomplete, but maybe I'm missing them?

mdolanme commented 7 years ago

I checked the MGI annotations for genes with no MF annotation and with IMP annotations to 'regulation' type BP (I think that is what Pascale had done): 198 distinct genes.

pgaudet commented 7 years ago

Hello @mdolanme

That number seems about right. Would it be possible to extract the genes names/IDs for all species (I guess the 12 ref genomes + CGD + Xenbase?)

It would be interesting to do the same exercise, excluding 'binding' terms if they are the only MF terms.

Thanks, Pascale

ValWood commented 7 years ago

You would need to exclude binding terms for sure. This will inflate the numbers a lot.

However, the majority the indirect IMP annotations are not annotated to the "regulation" term, but to the term itself (because if we did not know if regulating or involved in we should annotate to the term, this is where the indirect annotations which are neither "regulates" or "involved in" have collected).

If the regulation terms have been used correctly, only when something is shown to be regulating. Although I suspect people often use "regulation of" for any flavour of upstream.

pgaudet commented 7 years ago

Hi Val,

The issue with binding is that there are cases where it's the molecular mechanism of a process (as you can see when you do noctual models).

ValWood commented 7 years ago

Maybe I misunderstood.

I thought you were trying to identify the IMP annotations which are likely to be indirect, and I'm not convinced this will identify most of the offenders.

Partly because these annotation often continue to exist after the mechanistic detail is added.

For sure the binding is sometimes part of the mechanism. But there is lots of binding annotation that isn't mechanistic.

I will wait and see....

ValWood commented 7 years ago

Term matrix should identify all of the "indirect" outliers for core cellular processes... It will be interesting to see if this identifies the same gene products.

It could be used for development too, but I don't plan to look at those...

pgaudet commented 7 years ago

Val you are right - I was just saying proteins annotated to binding may be false positives.

As a reminder: Right now there is no relationship between proteins and processes (or any other GO annotations), and we have not been very consistent in how we annotated as a consortium. The goal is to have a formal way to link proteins and annotations. (Note that there is no suggestion to remove any annotations.) The most brutal option would be to link all annotation with the general relation 'upstream of or involved in', since we never paid attention to this, and annotate to 'involved in' as we go forward.

The goal of this issue is to see if we can be a bit smarter about this by idendifying annotations we could recue based on other information we have. For about 90% of the annotations [IMP + regulation + no function] I looked at, the role of the protein in the process could not be determined: a knockout has a lower level of a readout, an add-back experiment had a higher level of that readout. (The rest just were missing MF annotations, so anyway they would be a good group of proteins to prioritize to see if we can annotate a MF). In this case I feel that we would serve the community better to admit that we don't know how the protein is linked to a process rather than implicitely asserting 'involved in' as we are currently doing.

Ideally ( @mdolanme :D ) each group would look at

If over a certain fraction (50%? 60%? 80%? We can discuss the cut-off) of these groups has an unknown link to a process, we should treat that entire group as dubious and give it the qualifier.

Note that this approach is way underestimating the potential indirect annotations since it's on a protein-by-protein basis: if a protein has any kind of function (say ubiquitin ligase activity), any other annotation to related or unrelated processes will not be flagged.

I hope this clarifies the proposal.

pgaudet commented 7 years ago

Val - I would be happy to leave out any 'organism-level' process such as development, behavior, etc. I cannot think if ways to directly test roles of proteins relative to these types of higher level processes (but if someone has suggestions, please comment!).

Anything you can do with the matrix would be really neat to look at. Let me know if I can help.

-Pascale

ValWood commented 7 years ago

I'm a little confused. Maybe because we annotate more conservatively.

You can use genetics/phenotypes to make very specific process annotations where you might not know the function. For example the involvement of tel2 in the DNA replication checkpoint. At PomBase we would to keep all of our IMP phenotype annotations as "involved in" because we don't use phenotypes to make annotations unless we have detailed enough phenotypes to know that the phenotype is not indirect.

I'm interested in the results, but I think you will miss a lot of the indirect annotations only looking at "regulates" (because most will be made directly to the term) http://geneontology.org/page/go-annotation-conventions#regulationTerms (guideline 2). I think most adhere to this?

I can share the matrix analysis with you, but so far I have only done amino acid metabolism, DNA replication, tRNA metabolism, ribosome biogenesis and cytoplasmic translation. There are not many indirect annotations remaining for these terms now. Those retained (likely to be upstream regulation) are easy to pull out.

Seth has implemented the preliminary jenkins checks this week: https://build.berkeleybop.org/job/check-shared-annotations/lastBuild/console

vanaukenk commented 7 years ago

@mdolanme - Would you be able to get the same list you generated for MGI for other species?

Thx.

vanaukenk commented 7 years ago

Following on from 2017-04-11 annotation conference call, it would be helpful to have query templates to retrieve information from the GO database on:

Lists of genes/gene products annotated to: 1) any regulation BP term and no MF annotation 2) a BP term and regulation of that same BP