geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
223 stars 40 forks source link

make intracellular not for non-manual direct annotation ? #16257

Closed ValWood closed 6 years ago

ValWood commented 6 years ago

I don't know if manual annotations for "intracellular" are useful or not (we don't use this term).

But we should not get annotations to this term from any inference pipeline. It's just clutter....For example this one to a well-annotated gene product:

PomBase SPAC1296.02 cox4 GO:0005622 PMID:21873635 IBA PANTHER:PTN000012880|WB:WBGene00000371 C cytochrome c oxidase subunit IV (predicted) protein taxon:4896 20170228 GOC
PomBase SPAC1296.02 cox4 GO:0005623 PMID:21873635 IBA PANTHER:PTN000012880|WB:WBGene00000371 C cytochrome c oxidase subunit IV (predicted) protein taxon:4896 20170228 GOC
PomBase SPAC1296.02 cox4 GO:0005739 PMID:21873635 IBA PANTHER:PTN000012880|WB:WBGene00000371 C cytochrome c oxidase subunit IV (predicted) protein taxon:4896 20170228 GOC

pgaudet commented 6 years ago

It's true that I did this quite a bit. I dont know if the other curators did @marcfeuermann @krchristie

The reasoning being that it's not to know that the protein is NOT secreted, even if the actual localization is not quite clear (or varies too much among paralogs to be sure).

We can discuss removing them if people find it's not useful.

Thanks, Pascale

ValWood commented 6 years ago

cox4 though?

I can see for a GP with no other annotation...although I don't think it is very meaningful. If you can establish with certainty that a gene product is intracellular you should be able to say something else ?

pgaudet commented 6 years ago

like what ? In particular the mitochondrial proteins are a pain to localize, in plants many are found either (or both) in the mitochondrion and the chloroplast. I know I have also done 'membrane-bound organelle. for those

Where do you see that for cox4 ? It's not here: http://amigo.geneontology.org/amigo/gene_product/PomBase:SPAC1296.02

ValWood commented 6 years ago

because we filter it.... but is it a useful annotation more generally?

Cox4 could be respiratory chain (GO:0070469) (true for everything?)

pgaudet commented 6 years ago

What I mean is I dont see it in the PAINT annotations - see http://www.pantree.org/node/annotationNode.jsp?id=PTN000012880

ValWood commented 6 years ago

Oh sorry yes this wasn't made by PAINT. It was made by the function process inference pipeline (I'm not sure which term to term relationship but it is because of some relation hardcoded in the ontology).

I think this term should be made so that it is not possible to make the annotation via an automated pipeline. This would also prevent Ensembl annotations etc, but could be applied judiciously if required?

marcfeuermann commented 6 years ago

In general I Try not to use "intracellular" when doing PAINT annotation because I'm not convinced that it really brings a relevant information. I may understand Pascal's argument to discriminate from secreted proteins, but I consider it at the same level as "protein binding" for biological function or "growth" for biological process. Regards, Marc.

pgaudet commented 6 years ago

But it avoids a 'ND'

ValWood commented 6 years ago

True but in this case I wonder if ND is better.

For instance, internally at PomBase we block many "high level" process terms for annotation , because the annotation is so minimal we prefer to say the process is unknown.

This way, we can provide our users with well-curated lists of "unknowns" https://www.pombase.org/status/priority-unstudied-genes

This is one of our most popular lists.

We submit these to GO with BP ND and woe betide anyone who tries to stick a non-informative high-level process term, or some random annotation from a phenotype onto them via a pipeline.

So this to me is a bit similar. I know we annotate to inferred things with GO, but there should be some level of information required.

marcfeuermann commented 6 years ago

I agree. It is sometimes better to put "unknown" instead of minimal/vague/non-relevant/confusing information.

krchristie commented 6 years ago

I have never annotated to 'intracellular', either manually or when doing PAINT. It has never seemed particularly useful to me.

pgaudet commented 6 years ago

OK; for PAINT there is no problem for me to remove.

For others: There are >1000 manual annotations + >200 manual ISS, distributed as follows:

Group Number of annotations
AspGD 519
UniProt 208
GOC 141
MGI 115
TIGR 92
LIFEdb 70
GeneDB 43
dictyBase 33
SGD 27
BHF-UCL 24
ARUK-UCL 23
AgBase 18
UniProtKB 18
GR 15
TAIR 15
CGD 14
WB 14
CAFA 10
PINC 9
RGD 9
ParkinsonsUK-UCL 8
FlyBase 7
ZFIN 7
HGNC 5
ValWood commented 6 years ago

Although personally, I don't think it is a useful term (and I have never had a situation where I have seen that something it intracellular but I have not been able to be more specific), it might be useful for other resources. Might be worth asking why people needed it? I would rather be able to obtain the list of "unknown localization" than know that something was intracellular (but surely cytoplasmic, cytosolic, or cell cortex, or something else can be said ?).

We should not transfer this annotation. However, maybe we should allow direct annotation manually if people have a reason for making these?

BTW: Most of the Aspergillus ones (IDA) come from: https://www.ncbi.nlm.nih.gov/pubmed/20797444 I could not see any localization data in this paper. It's about translational changes. Of course, these gene products are intracellular when they are translated....... Preventing this is a good reason to block the term!

I suspect that most people inspecting these annotations would find a better annotation to use (if not already existing from another source!).

ValWood commented 6 years ago

For example the first on in the SGD list is RTP1

RTP1 is clearly a nucleocytoplasmic shuttling protein. It should be annotated to "nucleus" and "cytoplasm"

To check whether Rtp1p shuttles between the nucleus and the cytoplasm in an Xpo1p-dependent manner, we examined the localization of Rtp1-GFP in an XPO1T539C mutant strain. In this strain, leptomycin B (LMB) addition inhibits Xpo1p-mediated transport (36). The cellular distribution of Rtp1-GFP was not affected by LMB addition, even with long incubations times (Fig. 2C).

ValWood commented 6 years ago

Second SGD one GSH1 I can't see any localization data in this paper? This seems to be an "inference" I'm not checking any others but I'm sure that should all either be more specific, or removed...

marcfeuermann commented 6 years ago

BTW, I've just found an annotation to the CC term GO:0005623: "cell" Def: The basic structural and functional unit of all organisms. Includes the plasma membrane and any external encapsulating structures such as the cell wall and cell envelope. This seems even worse than "intracellular", don't you think so ? Regards, Marc.

cmungall commented 6 years ago

Would we not want to reserve use of cell for use in GO-CAMs? E.g. occurs_in some (cell and part_of some S)

pgaudet commented 6 years ago

Do we have instances of this ? I also find 'cell' is not very informative. I thought it had been created to distinguish from non-cellular organisms.

pgaudet commented 6 years ago
  1. I just noticed there are a lot of P-C predictions to intracellular and cell (for eg

    • GO:0005622 intracellular AnnotationPropagation C GO:0045047 protein targeting to ER IBA)
    • GO:0005623 cell AnnotationPropagation C GO:0001678 cellular glucose homeostasis IBA
  2. If we make a term 'too high level' for annotation, aren't its parents automatically excluded ? (as would be the case for 'cell' if 'intracellular' is too high level for annotation).

Thanks, Pascale

krchristie commented 6 years ago

I would be delighted if we quit generating annotations to 'cell' and to 'intracellular' due to the P-C links, so if marking these terms as 'too high level' would block these P-C link generated annotations, I think that's a step forwards.

ValWood commented 6 years ago

I would like to see them blocked too (which would not prevent their use in extensions via GO-Cam

If we make a term 'too high level' for annotation, aren't its parents automatically excluded

No because sometimes the parent is OK . For example we blocked transport but we would not want to block "localization'. I already mad a task a while ago to make a list of the parents of blocked terms which could be blocked. I'll submit a ticket soon-ish.

pgaudet commented 6 years ago

which would not prevent their use in extensions via GO-Cam

That seems complicated to implement the rules.

ValWood commented 6 years ago

It isn't really any different from what we do with biological phase terms right now. We can use them in extensions, but we can't annotate to them directly: https://www.ebi.ac.uk/QuickGO/term/GO:0044848

RLovering commented 6 years ago

At UCL we have looked at the majority of our annotations and in the progress of removing our 2 cell and 60 or so intracellular annotations. Although I haven't tried to look at our use of intracellular in the AE field.

ValWood commented 6 years ago

Hi @RLovering as far as I know there isn't any issue with "intracellular" in the AE field. Although, at present I think the inference pipeline would unfold them to instantiate the annotation. Personally, I don't think we really want to do that? @cmungall

My preference would be to allow these terms in extensions (for GO-CAM etc) but not allow for direct annotation (which would prevent direct annotations from being created by any pipeline).

I would add that so far, I never came across a situation for yeast where we could not be more specific than "intracellular", I'd be interested if such an annotation situation exists more generally. I suspect blocking for direct annotation will enforce more informative annotation.

Examples I saw included examples like "nucleus and cytoplasm". MAybe "intracellular" was selected because the "location of activity" was not known. However, with new relationships for location this would not be a problem. Biological end users definitely want to know these multiple locations. For most of the recent gene characterizations relating the pombe cytokinetic/spindle pole body/centrosome the location was known and annotated first. Processes followed based on biologists following the lead from GO CC data of proteins with location of interest. If we are supporting bench biologists we really need to capture these assayed locations in the absence of functional data.

Later we have been able to add extensions to describe which locations occurred_during which phases of the cell cycle. For many, we are confident about the locations (for example medial ring during interphase, spindle pole body during mitosis, spindle midzone during mitotic metaphase etc). Although we sometimes know little about the processes and even less about the functions.

pgaudet commented 6 years ago

name: intracellular +subset: gocheck_do_not_annotate

pgaudet commented 6 years ago

https://github.com/geneontology/go-ontology/pull/16445