geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
220 stars 40 forks source link

general terms not for direct annotation #16101

Open ValWood opened 6 years ago

ValWood commented 6 years ago

I said I would start to submit terms to be flagged "not for direct annotations" as it should always be possible to be more specific.

Here is the first batch. We plan to do this in stages. These are fairly no brainer. I think it should probably be a sort check (warnings) in the first instance, then become a hard check later. if possible it would be good to enforce as a hard check for new annotations but give people a chance to improve legacy annotations.

For most of these the block should probably propagate to most parents except for the root node.

~GO:0032502 developmental process (what type?) GO:0051704 multi-organism process (what type) GO:0022402 cell cycle process (what type?) GO:0022411 cellular component disassembly (which component?) GO:0022414 reproductive process (what type?) GO:0044703 multi-organism reproductive process (what type?) (1 direct annotation) GO:0044764 multi-organism cellular process (what type?) GO:0044765 single-organism transport GO:0019953 sexual reproduction (what type?) GO:0030281 structural constituent of cutaneous appendage (what type?) GO:0022403 cell cycle phase (this entire should be blocked anyway?) GO:0016043 cellular component organization GO:0006464 cellular protein modification process GO:0009058 biosynthetic process GO:0009059 macromolecule biosynthetic process GO:0009109 coenzyme catabolic process GO:0010638 positive regulation of organelle organization GO:0010639 negative regulation of organelle organization GO:0070925 organelle assembly GO:0006996 organelle organization GO:0005215 transporter activity GO:0048518 positive regulation of biological process GO:0048519 negative regulation of biological process GO:0048522 positive regulation of cellular process GO:0048523 negative regulation of cellular process (2 direct annotations) GO:2000241 regulation of reproductive process GO:1903506 regulation of nucleic acid-templated transcription GO:1903507 negative regulation of nucleic acid-templated transcription GO:1903508 positive regulation of nucleic acid-templated transcription GO:1903649 regulation of cytoplasmic transport GO:1903651 positive regulation of cytoplasmic transport GO:0043900 regulation of multi-organism process GO:0043902 positive regulation of multi-organism process~

reviewed below

ValWood commented 6 years ago

Some more (when I looked at these a few weeks ago I checked some direct annotation numbers, they were very low so these should not be very much annotation deepening, however I have >1000, some are quite specific I'm not sure they will all be able to be included as general rules for all systems)

~GO:0033036 macromolecule localization GO:0032879 regulation of localization GO:0033043 regulation of organelle organization GO:0051179 localization GO:0033365 protein localization to organelle GO:0051640 organelle localization GO:0051641 cellular localization GO:0008152 metabolic process GO:0009056 catabolic process GO:0009057 macromolecule catabolic process GO:0009894 regulation of catabolic process GO:0009987 cellular process GO:0009889 regulation of biosynthetic process GO:0009890 negative regulation of biosynthetic process GO:0009891 positive regulation of biosynthetic process GO:0009892 negative regulation of metabolic process GO:0009893 positive regulation of metabolic process GO:0031323 regulation of cellular metabolic process GO:0031324 negative regulation of cellular metabolic process GO:0031325 positive regulation of cellular metabolic process GO:0031326 regulation of cellular biosynthetic process GO:0031327 negative regulation of cellular biosynthetic process GO:0031328 positive regulation of cellular biosynthetic process GO:0031329 regulation of cellular catabolic process GO:0032386 regulation of intracellular transport GO:0032388 positive regulation of intracellular transport GO:0046907 intracellular transport~

reviewed below

ValWood commented 6 years ago

Note to self, the submitted terms are documented at the bottom of

GO_terms_excluded_from_pombase.txt

under "submitted to GO"

ukemi commented 6 years ago

Do we really want to do this in the era of GO-CAM models?

ValWood commented 6 years ago

Well, PomBase already annotatate using "GO-cam" model philosophy, we just don't use Noctua, we have adopted a "top down approach" to modelling rather than the Noctua "bottom up" approach. We now block over 1000 terms for high level annotation, where it should be logically possible to be more precise (spanning all 3 ontology branches). This procedure isn't inconsistent with GO-Cam modelling, because you should still be using the most granular process available, even in GO-Cam.

It's a very quick and easy low time investment/high return option for improving GO annotation consistency and removing redundancy globally, so I don't see a good reason not to apply to other groups....The QC group included this "high level term blocking" as one strategy to improve annotation quality (in fact we have been implementing this the transport and transcription branch).

In addition, many groups are moving toward more community curation input, and forcing annotation specificity is very helpful to get the most consistent and specific annotation from this increasingly important activity.

krchristie commented 6 years ago

@ukemi - I am wondering why use of GO-CAMs would effect whether or not we want to make direct annotation to these terms. I don't think I'm following your line of thought here.

RLovering commented 6 years ago

please can we keep the organelle terms as direct annotations allowable as I am not sure that we have the child terms for every organelle, therefore we will need to specify the organelle in some cases using the GO-CAM or AE options Thanks Ruth

ValWood commented 6 years ago

Example "organelle organization" total annotations 203,808 total direct annotations 81 total direct EXP annotations 17

example include:

HPS1, biogenesis of lysosomal organelles complex 3 subunit 1 MGI HPS3, biogenesis of lysosomal organelles complex 3 subunit 1 MGI HPS5, biogenesis of lysosomal organelles complex 3 subunit 1 MGI HPS6, biogenesis of lysosomal organelles complex 3 subunit 1 MGI GORASP2 | Golgi reassembly-stacking protein 2

How can there be an example of annotating to an organelle -type term where you would not know the organelle referred to?

As it is, these annotations are practically useless because they would include nuclear, mitochondrial, Golgi etc..... We should have terms for all organelles and if we don't we should add them?

ValWood commented 6 years ago

In pombe 25% of process annotated proteins are annotated to "organelle organization" we need to be more specific than this!

RLovering commented 6 years ago

So you want to create organelle specific terms for all of these terms: GO:0070925 organelle assembly GO:0006996 organelle organization GO:0010638 positive regulation of organelle organization GO:0010639 negative regulation of organelle organization GO:1903649 regulation of cytoplasmic transport GO:1903651 positive regulation of cytoplasmic transport

plus component specific terms for all of these (which includes complexes, but I guess in this case the complex general term can be used) GO:0022411 cellular component disassembly GO:0016043 cellular component organization

plus protein/complex specific terms for all of these: GO:0009059 macromolecule biosynthetic process GO:0009109 coenzyme catabolic process GO:0005215 transporter activity GO:0006464 cellular protein modification process

ValWood commented 6 years ago

@RLovering I think you have misunderstood what we are doing here.

The specific terms already exist. They should be used for annotation rather than the broad terms.

There terms are very rarely used, and that is good, but the point is that they should not be used at all, becasue they are not informative about the biology.

ValWood commented 6 years ago

~GO:0016043 cellular component organization Again 26% of pombe proteins have this annotation. NONE are direct. Why would anyone want to annotate directly to this term, it doesn't tell you anything?~

done

ukemi commented 6 years ago

Hi @krchristie. For exactly the reason Ruth points out above. Although some specific terms already exist, I'm not sure they are exhaustive. So for a term like 'cellular component organization', to be exhaustive we would need to create a child for every type of cellular component that can be organized.

ValWood commented 6 years ago

we would need to create a child for every type of cellular component that can be organized.

Surely we need to go beneath the granularity of "GO:0016043 cellular component organization".....see above, this term is rarely used, and rightly so.

Eventually, you will need a precise process. Exactly what is being organized? the mitochondrial membrane? the crista? the respiratory chain? the genome? How is being organized (by assembly? disassembly? by fusion?). Organization alone is never going to be informative enough to describe a biological process.

I selected these terms very carefully because they should not require much, if any reannotation. However, they will force curators to think about how to annotate precisely and improve annotation consistency.

RLovering commented 6 years ago

The other problem is that many components are dynamic, and it can be difficult to know if proteins are regulating the assembly or disassembly. ie pos reg of assembly may look the same as neg reg of disassembly.

ValWood commented 6 years ago

I agree that you might not know these specific details. But this isn't the point, I guarantee that for sure you can annotate to a more specific than: GO:0016043 cellular component organization in a way that could make the annotation more useful

I had much the same response to this suggestion initially when I wanted to implement at PomBase, and it also very much reminds me of the initial objection to taxon ~distribution~ restriction back in St. Croix. In this case, the objection is probably largely a misunderstanding, from mis-explanation of the procedure by me, so bear with me...

For starters, these checks should initially be a soft check (warning) at least for existing annotation. Already, we @pgaudet have been making similar annotation blocks surreptitiously (in GO) and nobody appears to have noticed (did you know., for example, that you can no longer annotate directly to transport ?....Would you have complained about this if we announced it? ....as I can see no existing direct annotations.

This method really works as QC/QA procedure. If object to specific rules once implemented we can then discuss and refine- exactly as we already do with taxon constraints, and as we will do shortly with "matrix intersection rules". Nothing here is set in stone.

The way that this QC process would work is for groups to object to specific rules once implemented, and then we can refine, or even drop the rule if necessary (as with taxon constraints) rather than the objecting to the overall concept before any specific rules are implemented, and before you can see if you even have any violations to the rule to object to.

ValWood commented 6 years ago

~cellular component organisation and biogenesis (0 direct annotations) cellular component organization (2 direct annotations) cellular component maintenance (3 direct annotations)~

reviewed below

ValWood commented 6 years ago

~regulation of cellular component organization (18) positive regulation of cellular component biogenesis (8) regulation of cellular component organization (2) cellular component assembly (7) cellular process 393 localization (0)~

complete

pgaudet commented 6 years ago

Hi @ValWood

This is great - but just to make it easier to follow, can we create a Google spreadsheet? With GO ID- term label - # annotations - # mappings (IPR / SP toGO etc)

Those with 0 annotations are can do rapidly. Thanks, Pascale 

ValWood commented 5 years ago

~GO:0008219 cell death (type (need to know it is programmed death… necrosis, apoptosis etc) GO:0007050 cell cycle arrest (mitotic? meiotic? phase?) GO:0009987 cellular process (which process?) (393 direct annotations all evidence codes) GO:0032501 multicellular organismal process (which process?) (2 direct annotations all evidence codes) GO:0044764 multi organism cellular process (which process?) (0 direct annotations, any evidence code) GO:0022607 cellular component assembly (of which component?) (7 direct annotations, any evidence code) GO:0071840 cellular component organisation or biogenesis (of which component?) (0 direct annotations, any evidence code) GO:0016043 cellular component organization (of which component?) (68 direct annotations, any evidence code) GO:0044085 cellular component biogenesis (of which component?) (0) GO:0051179 localization (or what, to where?) (0 direct annotations, any evidence code) PLUS ALL CORRESPONDING REGULATION TERMS (same rationale)~

reviewed below

ValWood commented 5 years ago

now in a spreadsheet in the QC directory called High_level_terms_not_for_direct_annotation

https://docs.google.com/spreadsheets/d/1B_ItIn5bX_4gj41MQIQoBlpaxLT1co14ALBUYhEQahk/edit#gid=0

ValWood commented 5 years ago

@pgaudet re the recent discussion about high level terms- I suggest we flag the ones in the spreadsheet above which already have this attribute. Then add restrictions to the ones without. After this, we can move on to mine and Christophe's longer lists and add them to this spreadsheet?

Is that a realistic plan?

pgaudet commented 2 years ago

Other terms in the attached doc High_level_terms_not_for_direct_annotation.xlsx

RLovering commented 2 years ago

Ok I have removed/changed some of UCL annotations, cell death : 3 annotations developmental process: 1 annotation we have no annotations to regulation of developmental process (+/-) - which I assume are not acceptable.

But we have 51 annotations to the regulation of cell death terms - which I assume are not acceptable https://www.ebi.ac.uk/QuickGO/annotations?goUsage=exact&goUsageRelationships=is_a,part_of,occurs_in&goId=GO:0010942,GO:0060548,GO:0010941&assignedBy=ARUK-UCL,Alzheimers_University_of_Toronto,BHF-UCL,HGNC-UCL,ParkinsonsUK-UCL,SynGO-UCL

However, these will take longer to assess and update. Are you planning to created a doc for us to record the edits?

Thanks

Ruth

pgaudet commented 2 years ago

Yes yes, this is a placeholder for now, no immediate action needed (lower priority than other items). Thanks for checking!

Pascale

ValWood commented 1 month ago

Most of the above are now addressed. Revised list below:

GO:0009057 macromolecule catabolic process GO:0009056 catabolic process GO:0009058 biosynthetic process GO:0031326 regulation of cellular biosynthetic process +/- GO:0070925 organelle assembly GO:0051640 organelle localization GO:0033365 protein localization to organelle GO:2000241 regulation of reproductive process +/- GO:0008152 metabolic process GO:0033043 regulation of organelle organization +/- GO:0033036 macromolecule localization GO:0044703 multi-organism reproductive process GO:0019953 sexual reproduction GO:0030281 structural constituent of cutaneous appendage GO:0033043 regulation of organelle organization +/- GO:0010639 negative regulation of organelle organization GO:0010638 positive regulation of organelle organization GO:0009889 regulation of biosynthetic process +/- GO:0022414 reproductive process GO:0032502 developmental process (we probably need to allow for extensions) GO:0046907 intracellular transport GO:0019222 regulation of metabolic process +/- GO:0009889 regulation of biosynthetic process +/- GO:0009894 regulation of catabolic process +/- GO:0032386 regulation of intracellular transport +/- GO:0036211 protein modification process (we probably need to keep this) GO:0022607 cellular component assembly GO:0051128 regulation of cellular component organization +/- GO:0044087 regulation of cellular component biogenesis +/- GO:0043954 cellular component maintenance (3 direct annotations)

check that these are suitable and in the appropriate spreadsheet.