geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

regulation and causality #1532

Closed ValWood closed 7 months ago

ValWood commented 7 years ago

Last week I surveyed all of the PomBase annotations to "biological regulation" We have "biological regulation" annotations for 1161 gene products.

All of these were either:

  1. Within process regulation (i.e regulating and activity, decision, localization)
  2. Regulation via gene product degradation (this might be a type of within process regulation, I'm not sure).
  3. Regulation of a biological entity (we can ignore this for now)
  4. Upstream regulation via regulation of gene expression or signalling

The important point is that we don't have any examples of upstream regulation which are not via signalling or gene expression. Of course we might not have examples of everything...but I wonder if it is completely useful to annotate "causally upstream of" if we know an effect is indirect.

Do we intend to add this qualifier consistently long term? i.e. to everything which is causally upstream of some process but not regulating it? This would be extend the curation burden a lot from unmanageable to even more unmanageable.

I did a quick mock up of just some of the dependancies between cellular processes:

regulation and causailty

which might illustrate the problems somewhat.

In summary the red lines are 'real' regulation i) Sigalling regulates most (all )processes ii) Regulation of gene expression regulates some processes (in higher euks this probably includes development). iii) Metabolic intermediates regulate transcription and metabolic processes iv) Some processes regulated by proteolysis

The blue lines are causal relationships between processes that are not known/considered to be regulatory.

The black numbers are "involved in" The red numbers are involved in + regulates The blue numbers are involved in + regulates + causally upstream

Taking "regulation of mitotic cell cycle transition" as an example. 164 genes are involved in regulation of mitotic cell cycle transitions. This covers both "involved in" and "regulation" because this is a regulation term.

If we added in everything which was causally upstream, based on current phenotypes, the number of gene products annotated to this term would be inflated to 947, which may not be useful. You observe cell cycle defects if by compromising gene products involved in transcription, ribosome biogenesis, splicing, translation, DNA replication. Phenotypes are sometimes cryptic for core processes because the phenotypic hit is taken by the process with the most unstable protein products, in a normal cell these upstream processes do not have a regulatory affect.

I guess my concern is what is the purpose of causally upstream of? I can see the value of "causally upstream of or within" for situations where you can't make a judgement call based on the available data but I initially thought these would be removed as more information became available.

Is the purpose to eventually annotate everything which "causally affects" something eventually (i.e abnormal processes which were previously outside the scope of GO)? This would also increase the curation burden substantially, and might confuse users.

I can see a purpose for closer causal relationships (for example a specific modification required for a process to occur, but maybe there is something more precise and currently undocumented about the types of causal relationship we aim to capture. Many of these "causally upstream type observations appear to be general biological observations/principles and therefore they might be better captured in the ontology in some way rather than in ad hoc annotations and LEGO models.

Just some thoughts....I don't have any particular suggestions, because the plan/purpose is not very clear to me....

I think my primary concern(s) are a) Consistency within and between annotation groups ('within' may turn out to be a bigger issue) b) Additional curation burden and c) Obscurity of meaning to downstream users.

dosumis commented 7 years ago

Cool analysis!

If we added in everything which was causally upstream, based on current phenotypes, the number of gene products annotated to this term would be inflated to 947, which may not be useful.

Agreed. So that should not be the aim. The reason for the new qualifiers is that there are large numbers of existing IMP annotations coming from MODs covering multicellular organisms that are already in the ambiguous 'acts upstream of or within' category. It is useful to make the status of these annotations clearer. It is also possible to generate casually_upstream_of annotations from LEGO models. The plan right now, AFAIK, is to generate these but not to provide them in the default sets of annotations to download. @ukemi is keen on generating these. I'll leave it for him to comment on whether this will continue long term, given worries about inflation of number of annotations.

ukemi commented 7 years ago

@vanaukenk what do you think? We have considered this, but this is a nice concrete example. At this point, I think we would only generate these if they are in models, which would mean that a curator thinks they are relevant, but I see your point. Right now, I think the people in our group want to see the annotations generated from their specific models. As we know more and more the number of genes would increase dramatically. I certainly don't see curators trying to curate every single causal event for a downstream process. Are you coming to the GOC meeting.

ukemi commented 7 years ago

Thinking about this more, it does remind me of an older model that @vanaukenk built. The beautiful thing about the diagram above is that it models the biology in its entirely. The challenging thing is how to digest this information into the old annotation paradigm which is is far less complete. Our current annotation probably contain pieces of information similar to all of the nodes in the model above. Up until this point, we have cast a wide net with respect to uncertain phenotype-based information. But maybe in the future we want to slice annotations based on use. For enrichment analysis, we want to have only information in which gene products play a direct role, but in other more exploratory endeavors, we will want to be able to see causality. We need to break out of the old paradigm as we are able to represent more complex and complete biology. It's a good thing.

ValWood commented 7 years ago

Q What is the exact meaning and purpose of the qualifiers?

At this point, I think we would only generate these if they are in models, which would mean that a curator thinks they are relevant.

I can see that it is satisfying for a curator to capture this information, and useful to do so. However, this will likely result in curators capturing the same indirect information over again many times, in multiple models, when that information is actually a "given" based on biological dependencies and is really more relevant to how the independently curated models are integrated/linked together to describe a cell or organism.

So, when and why does the curator think that "causally upstream" is relevant? It would be useful to pin down the rationale for the curatorial decision. We may have enough individual examples from Matrix analysis to come up with rules/suggestions. Relevance may depend on the number of targets and context...I have some suggestions but I am still thinking about this...

Maybe we need to separately assess the different goals of the use of qualifiers for a) within LEGO modelling and b) Pascale's needs to alert PAINTers to annotations which should not be transferred. Both of these use cases will promote ad hoc use of this qualifier, which may not be useful to users. The rationale for their use is slightly different.

Note, It is also useful to be able to enrich over "indirect effects" in addition to "involved in" and "regulates" to explain the results of high though-put datasets. However, enrichment requires consistently annotated datasets (a large curation burden) to be able to do so. So is this also a goal?

Q: What do the current set of curators mean? On the call yesterday it was very clear that the precise meaning use was not completely clear to anybody..... We need biological examples of the use of each in addition to the abstract logical descriptions like

My brain cannot compute over this..... "p 'causally upstream or within' q if (1) the end of p is before the end of q and (2) the execution of p exerts some causal influence over the outputs of q; i.e. if p was abolished or the outputs of p were to be modified, this would necessarily affect q."

Plus it was not clear that all of the qualifiers were required, biological examples would help to determine this too.

Q: Are qualifiers the best approach? Based on the historical misuse/ignoring of qualifiers by downstream users, I'm not convinced that this is the best approach. At PomBase we have tried to restrict the use of the contributes_to and colocalizes_with to only instances where the annotation is still completely valid if the qualifier is ignored (for example we would only use colocalizes with for a MF where the active site is distributed across a number of subunits). Colocalizes with could probably be completely deprecated with a little curation work.

As a case in point, the Ensembl inference pipeline is ignoring the contributes_to qualifier http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:1990518#term=annotation when transferring these annotations using pombe mcm6 (which is OK for the reasons above). If the EBI are still ignoring these qualifiers we can't expect our average user to filter qualifiers.....

The new qualifiers for causally upstream are clearly more dangerous than existing qualifiers if ignored.

Unfortunately I won't be at the GO meeting, I am not travelling to the US at the moment. I will attend remotely...

dosumis commented 7 years ago

We need biological examples of the use of each in addition to the abstract logical descriptions like.

Plenty of cases coming out of LEGO inference. @ukemi @vanaukenk @balhoff - maybe we should focus on documenting some examples from cases we're looking at on the inference call?

As a case in point, the Ensembl inference pipeline is ignoring the contributes_to qualifier

I strongly believe that GO should be doing the filtering in the annotation sets we provide, with the default option the set with only involved_in and an option to download the full set.

The use of acts_upstream_of_or_within as a translation of existing IMP annotations from some MODS (probably not POMBASE) is to simply be clear about the current state of those annotations. If the MODS that generated these annotations want to carry on annotating as before, then it is better that we get them to acknowledge this by using the qualifier. BTW - I suspect that some of the clearest examples will be developmental. In these cases, causally upstream of can cross cells: GP-X required to make cell type Y which produces an inductive signal to specify cell-type Z; Increased rate of division => increased cell numbers in an imaginal disc => increased cell death in wing disc cells...

ukemi commented 7 years ago

So, when and why does the curator think that "causally upstream" is relevant? It would be useful to pin down the rationale for the curatorial decision. We may have enough individual examples from Matrix analysis to come up with rules/suggestions. Relevance may depend on the number of targets and context...I have some suggestions but I am still thinking about this...

One example. I am pretty sure that one strategy for designing drugs to interfere with the RAS pathway is to interfere with the modification of the protein so that it doesn't insert into the membrane. The function that is targeted is causally upstream of any RAS signaling. I think it would be very useful to be able to query GO for specific mechanisms that would have a downstream effect on a given process. I think this is a common case for drug design and probably will be for personalized genomic therapies as well. It would also give us really valuable information with respect to querying for off target effects of a drugs. Lately I have been thinking a lot about this and so I may be biased. But one of the things that I love about the LEGO models is that they can represent all this complexity.

But your point is well-taken in that this information is probably more harmful than good in our conventional way of thinking about annotations that are primarily used for enrichment analysis. I tend to agree that once we are certain that a gene product is involved in a process that is causally upstream of another process maybe we want to get rid of it. But what about all the ones where we still don't know? In the past we have made those phenotype-based annotations. I still tend to think we keep them for now.

ValWood commented 7 years ago

I strongly believe that GO should be doing the filtering in the annotation sets we provide, with the default option the set with only involved_in and an option to download the full set.

yes I was going to suggest something similar. We should point the users to the correct datasets for their purpose, rather than require them to post process

ValWood commented 7 years ago

BTW - I suspect that some of the clearest examples will be developmental. In these cases, causally upstream of can cross cells: GP-X required to make cell type Y which produces an inductive signal to specify cell-type Z; Increased rate of division => increased cell numbers in an imaginal disc => increased cell death in wing disc cells...

So these cases, although upstream, sound like "real biological regulation", a signalling pathway is involved.

The examples I point out above are not at all regulatory in a normal cell (there are a couple of exceptions in the above graph which I have 'glossed over' here. I will pull these out as examples of "regulation" which could be obscured using the current proposal. I'm working through these slowly).

I think this boils down to a slightly different meaning of the phrases "indirectly upstream" and "regulating" within GO, and within the broader biological community. Regulation depends on some signal to change a downstream process. Above I'm referring to instances where there is no signal, but the upstream process got stuffed up by a defect which has lots of consequences.

dosumis commented 7 years ago

... instances where there is no signal, but the upstream process got stuffed up by a defect which has lots of consequences.

Like this:

GP-X required to make cell type Y which produces an inductive signal to specify cell-type Z

Stuffed upstream process = cell type Y differentiation. No GP-X => no cell type Y => no inductive signal => no cell type Z

If GP-X regulated the production of the inductive signal in cell-type Y then it seems reasonable to use regulates. But I suspect there is a big grey area in the middle here. Should regulatory chains cross gene-expression? (developmental biologists would say yes I think, but would cell biologists?) What if GP-X is involved in some quite generic process in the ER that is required for secretion of the inductive signal? My sense from having heard arguments about this since I started at GO is that this is very hard to codify in a way that satisfies everyone. Allowing LEGO curators to assert 'causally upstream of' in a model when they are unsure whether regulation applies seems fine to me. As this is a more general relation, the curation won't be wrong even if others would have used regulates in the same case.

ValWood commented 7 years ago

The function that is targeted is causally upstream of any RAS signaling.

This gets us closer to which "causally upstream" events are useful in LEGO: Localization events for specific pathway components.

In general the upstream events we are interested in are the ones which have specific targets (almost a linear upstream extension of specific pathways), and not the ones which affect nearly every process in a cell ( for example the effect on homocysteine levels of reducing MTHFR affecting methylation, which will affect most cellular processes).

ukemi commented 7 years ago

Exactly! I think this is what @thomaspd is getting at when he talks about coordinated processes. We don't want RNA polymerase annotated to processes that are carried out by all of the gene products of every mRNA it transcribes.

vanaukenk commented 7 years ago

Yes, I think what we're trying to capture in our annotations and LEGO models are those upstream events that have specific relationships to a downstream process, e.g. a transcription factor that specifies the fate of a neuron required for a particular behavior or the specific role of an ER protein like Wntless in the Wnt signaling pathway.

Yesterday after the call I reviewed some of my LEGO models where I've tried to capture that kind of information, and I do wonder if part of the issue is that these qualifiers are very general and thus, although an improvement over conventional annotation, still not specific enough to say what we want to say.

I also reviewed the BP annotations we have for the unc-13 gene, a priming factor for synaptic vesicle release. The existing annotations reflect the cellular level function of unc-13, but also the downstream consequences of losing that function which are mainly effects on different behaviors and development. I put that analysis, along with some comments in a spreadsheet in the GO folder on Google:

https://docs.google.com/spreadsheets/d/1HbvdyKmRI7Zhj6qPqVEsvyN6cMwahsf4aLa54fiROvs

I'd like to propose that curators split up into small working groups to look at different areas of biology (e.g. behavior, development), examine some of the existing annotations in their organism, and then come up with proposals for the June GO meeting about how we might capture information appropriately and consistently for these, and other areas, of biology.

ValWood commented 7 years ago

+100 actually....We should begin from the specific biological examples and classify according to precisely what we are trying to capture...It sounds more complicated, but it will be clearer for curators and users in the long term.

ValWood commented 7 years ago

Since this is clearly a "brainstorming" ticket

The use of acts_upstream_of_or_within as a translation of existing IMP annotations from some MODS (probably not POMBASE) is to simply be clear about the current state of those annotations.

I have more thoughts about how we use phenotype data at PomBase which might be useful. It is still on my todo list to document these, but we use additional information about the "specificity" of the phenotype to make a GO annotation.

For example for phenotypes like "chromosome segregation" alone we would not make a phenotype annotation. Defects like "chromosome loss" and "lagging chromosomes" are observed from very upstream processes.

However, specific phenotypes defining steps within the process "abnormal mitotic metaphase chromosome recapture" or "abnormal mitotic sister chromatid arm separation" aimed at figuring out precisely how the gene fits into the process are fine. I ask the question "is this phenotype specific to this process". We also use curator judgement and known biology when making GO annotations based on phenotypes (is this gene product involved in this process in other species). Is it in the right place at the right time? The observation used to make the inference is a phenotype but whether to make the assertion or not should be a considered judgement of the curator (this is after all, why our job is not automatable).

So part of the assessment of phenotype annotation may only require assessing IMP to non-specific GO terms....

pfey03 commented 7 years ago

I looked over our list of IMP processes, and yes, it's a VERY manual thing to add qualifiers! I weeded out those that are specific and correct, like enzymatic processes or so, but still have close to 2000 (some might still be ok as directly involved).

Let me know where to post specific examples (here?) that would need a qualifier because either there are no regulation terms in the ontology (-> involved_in_regulation_of), or they acts_upstream_of / acts_upstream_of_or_within.

Probably some gene/protein groups could be bulk qualifier annotated by sending a list to Tony.

Thanks! Petra

@rjdodson

vanaukenk commented 7 years ago

Thanks @pfey03 I started a Google spreadsheet on the GO's Google directory: https://docs.google.com/spreadsheets/d/1HbvdyKmRI7Zhj6qPqVEsvyN6cMwahsf4aLa54fiROvs

to start showing some examples from C. elegans.

If you'd like to add a dicty sheet to that spreadsheet, that'd be great.

pgaudet commented 5 years ago

copied from #2041 I have a question and some comments.

Q. Why "upstream of or within" if you know it is upstream?

Comments

Note that PomBase don't want to receive ANY "causally upstream of" or "upstream or within" by annotation transfer pipelines. There are lots of reasons for this.

Early in GO annotation our users dissuaded us from making indirect (here I'm using indirect to mean NOT directly involved in a process), for GO annotations.

I have tried before to indicate the effects of making casually upstream but external to a process annotations. Look at the affect on "mitotic cell cycle regulation" and "chromosome segregation" in the mock up figure #1532 This type of "indirect annotation" would never be complete (and their annotation will be arbitrary). These annotations will distill the usefulness of GO for enrichment analysis and slimming of the bona fida processes.

In our experience, biologists working on a process want the genes involved directly in the process. We annotate genes affecting a process using phenotype annotations and we would like to maintain this distinction. The reasons should be obvious if you look at the cell cycle process. The numbers in my figure are only the tip of the iceberg. Probably at some level nearly everything affects the cell cycle, so such causal annotation becomes rather meaningless.

I know we could ignore any "causally upstream" qualifer, but we will want to use this qualifier within a process or pathway when we have not got all of the information to complete a model (i.e we don't know all of the connecting steps). I still don't know which causal relationship specifically refers to this type of "with process" causality.

For example, process x is causally upstream of process y is causally upstream of process z (i.e. ribosome biognesis is upstream of translation, is upstream of splicing (for the spliced subset of genes), etc....
This could be captured in a more robust and complete way by connecting pathways, than making individual annotations that say:

gene x part_of ribosome biogenesis gene y causally_upstream_of translation

You might say "But we would not make such annotations"

I refer you back to point 1. -> Arbitrary annotation. To be useful you would need to do it for everything or not at all.

So in the example above, we know that mammary gland development is required for lacatation to occur so why report this at annotation time. It's a given?

We could for example convert our PomBase 80,000 phenotype annotations which use GO logical definitions into GO with a causally upstream of qualifier to the affected downstream process. How would this be useful? All it would do is make it harder for users to interpret, consume and process GO annotation.

I think we are creating future problems unless we figure this out and be very precise about the different types, relevance and usefulness of a "Causally upstream" annotation.


We need clear guidelines.

ValWood commented 5 years ago

Will we discuss this in Montreal?

pgaudet commented 5 years ago

It'd be nice. Let's first discuss it on Monday.

suzialeksander commented 7 months ago

closing as these exist in Noctua now.