geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
217 stars 40 forks source link

Modification to AGR slim #13791

Closed hdrabkin closed 6 years ago

hdrabkin commented 7 years ago

This is for Judy

After analysis of experimental annotations from AGR MODs This is a list of the number of gene products assigned to each category for the AGR organisms plus human: only experimental evidence includes inferring over regulates relation NOT excluded merged_slim.txt

dual taxon excluded merged_slim.txt

Attached is the file

judy

ukemi commented 7 years ago

What is the actionable item on this ticket?

vanaukenk commented 7 years ago

Can we discuss this on the manager's call today?

hdrabkin commented 7 years ago

Mary Dolan might be able to enlighten us; I can't find her git id

hdrabkin commented 7 years ago

@judyblake @mdolanme But I can't assign either. Could also do Mary Dolan but can't find an assignable git

pgaudet commented 7 years ago

Hello,

Does this superseed #13013 ? If so, can we close that other issue ?

Thanks, Pascale

pgaudet commented 7 years ago

The action is to update the AGR-slim with the terms in the file merged_slim.txt

pgaudet commented 7 years ago

REMOVED CC GO:0016020 membrane GO:0071944 cell periphery

MF GO:0003824 catalysis

BP GO:0071840 cellular organization/biogenesis GO:0051179 cellular transport/localization
GO:0032502 development
GO:0006259 DNA metabolism

ADDED CC GO:0031410 cytoplasmic vesicle GO:0005783 endoplasmic reticulum GO:0005768 endosome GO:0005794 Golgi apparatus GO:0070013 intracellular organelle lumen GO:0031967 organelle envelope GO:0005886 plasma membrane GO:0045202 synapse GO:0005773 vacuole

MF GO:0097367 carbohydrate derivative binding GO:0030234 enzyme regulator activity GO:0016787 hydrolase activity GO:0016874 ligase activity GO:0008289 lipid binding GO:0016491 oxidoreductase activity GO:0000988 transcription factor activity, protein binding GO:0008134 transcription factor binding GO:0016740 transferase activity

BP GO:1901135 carbohydrate derivative metabolic process GO:0008219 cell death GO:0030154 cell differentiation GO:0016043 cellular component organization GO:0051234 establishment of localization GO:0042592 homeostatic process GO:0006629 lipid metabolic process GO:0097659 nucleic acid-templated transcription

mdolanme commented 7 years ago

Additional changes to AGR slim: TO BE REMOVED CC GO:0070013 intracellular organelle lumen GO:0032991 macromolecular complex GO:0031967 organelle envelope BP GO:0010467 gene expression GO:0044281 small molecule metabolic process GO:0048731 system development

TO BE ADDED CC GO:0043234 protein complex BP GO:0005975 carbohydrate metabolic process GO:0009056 catabolic process GO:0007275 multicellular organism development GO:0065009 regulation of molecular function GO:0007049 cell cycle

mdolanme commented 7 years ago

AGR-GOslim.pdf

pgaudet commented 7 years ago

TO BE REMOVED CC GO:0070013 intracellular organelle lumen OK GO:0032991 macromolecular complex OK GO:0031967 organelle envelope OK BP GO:0010467 gene expression OK GO:0044281 small molecule metabolic process OK GO:0048731 system development OK

TO BE ADDED CC GO:0043234 protein complex OK (however I thought @ukemi was merging this with 'macromolecular complex' - I cannot find the issue anymore though) BP GO:0005975 carbohydrate metabolic process OK GO:0009056 catabolic process OK GO:0007275 multicellular organism development OK GO:0065009 regulation of molecular function OK GO:0007049 cell cycle

vanaukenk commented 7 years ago

Hi,

I've reviewed the C. elegans annotations wrt to the AGR slim and have a few thoughts for additions to the BP and CC slims. Some of these may also warrant taking a look at the ontology to see if there's any missing parentage that could be added.

BP

  1. On the left side of the BP graph, C. elegans has a lot of genes annotated to developmental processes that don't actually map up to 'multicellular organism development' in the ontology. Some of these terms are related to morphogenesis, but this also includes genes annotated to 'aging', which is a huge part of C. elegans research. One possible solution might be to put back 'developmental process' which looks to have been replaced by 'multicellular organism development'.

  2. Locomotion is another term that seems to have quite a few annotations, but it is a direct child of biological_process and doesn't seem to get included otherwise.

  3. Where would 'RNA processing' and its child terms map in the current version of the slim? Does it make sense to include a nucleic acid metabolic process term? That would overlap a bit with the nucleic-acid templated transcription term, though.

CC

  1. Muscle organization is extensively studied in C.elegans (as in other organisms), so we have a lot of annotations to various parts of the contractile fiber. In the ontology, 'contractile fiber' is not connected to 'cytoskeleton', and it looks like many of the annotations we have in this branch of the CC ontology, are thus not getting mapped up to any of the slim terms. Perhaps addition of 'contractile fiber' would help with this.

  2. It looks like 'macromolecular complex' has been in and out of the CC, and it currently back out? We have a number of annotations to various children of 'ribonucleoprotein complex' that are not getting mapped to anything because 'macromolecular complex' was replaced by 'protein complex'.

  3. We have annotations to three CC terms that I'm not quite sure how to fit in. Of these there, maybe cell cortex is a reasonable addition, but the other two are probably not warranted and perhaps should be refined more in either the ontology or annotations, if possible.

I should note, though, that as part of the Wnt signaling project, @hattrill and I were discussing annotation to 'cell cortex', and related terms that try to describe the internal region of the cell adjacent to the plasma membrane, and how it can be difficult to decide which of the GO terms we have for this region are appropriate to choose given experimental findings. Some ontology work and annotation guidelines for this area of the CC may also be needed.

mdolanme commented 7 years ago

Kimberly's comments on BP along with the MGI-GO group action in response (in bold):

Kimberly's comments on CC along with the MGI-GO group action in response (in bold):

mdolanme commented 7 years ago

As a result of Kimberly's review as described in the last comment, we suggest the following changes: TO BE REMOVED BP GO:0007275 multicellular organism development GO:0097659 nucleic acid-templated transcription

TO BE ADDED BP GO:0032502 developmental process GO:0006259 DNA metabolic process GO:0016070 RNA metabolic process

pgaudet commented 7 years ago

TO BE REMOVED BP GO:0007275 multicellular organism development OK GO:0097659 nucleic acid-templated transcription OK

TO BE ADDED BP GO:0032502 developmental process OK GO:0006259 DNA metabolic process OK GO:0016070 RNA metabolic process OK

doughowe commented 7 years ago

Here's an idea (I'm not going to claim it's a good one)... Include one block in the ribbon for each GO aspect for an "other" term, which would basically catch all genes that don't map up to another term in the slim. Terms could be like:

Other cellular component Other biological process Other molecular function

This way every gene maps up to something, so it can at least be found through the ribbon.

hattrill commented 7 years ago

We have implemented "other xx" ribbon boxes on the FlyBase 2.0 beta release and I quite like it - not only does it show the difference between no annotations vs not slimable annotations, it also gives a good indication that not everything is captured by the slim.

ValWood commented 7 years ago

I agree that it is a good idea. IMHO any slim should also show 2 additional bins i) genes which map to terms which are not included in the slim (for that aspect) ii) Genes which do not map to any slim term (for the aspect)

We say this in https://www.ncbi.nlm.nih.gov/pubmed/18475267

Because slim bins are not mutually exclusive, you can't actually evaluate a slim coverage (and hence meaning), fully if you do not have these bins (-although this does not strictly apply in the case of the ribbon diagrams which are primarily used a display tool).

We always display the fission yeast slim with :

Total slimmed gene products (protein and ncRNA): 4576 Gene products with biological process annotation, but not in any of the categories above: 27 Gene products with no biological process annotation: 748

like so: http://preview.pombase.org/browse-curation/fission-yeast-go-slim-terms

I asked the Princeton GO term mapper to include these 2 categories, and because you can upload any slim, you can assess these two bins for your organism with the current AGR slim: http://go.princeton.edu/cgi-bin/GOTermMapper

The matrix is also a useful tool for assessing slims: It allows you to evaluate the overlap between slim categories based on your annotation set (so basically you can assess the "resolution" of the slim for your organism): http://amigo.geneontology.org/matrix

pgaudet commented 7 years ago

For information: This is the current AGR slim:

Biological process behavior carbohydrate derivative metabolic process carbohydrate metabolic process catabolic process cell cycle cell death cell differentiation cell junction cell projection cell proliferation cellular component organization developmental process DNA metabolic process establishment of localization homeostatic process immune system process lipid metabolic process nervous system process protein metabolic process regulation of biological process regulation of molecular function reproduction response to stimulus

Molecular Function carbohydrate binding carbohydrate derivative binding cytoskeletal protein binding DNA binding enzyme regulator activity hydrolase activity ligase activity lipid binding metal ion binding nucleic acid binding transcription factor activity

Cellular Component chromosome cytoplasmic vesicle cytoskeleton cytosol endoplasmic reticulum endosome extracellular region Golgi apparatus mitochondrion nucleus plasma membrane protein complex oxidoreductase activity receptor activity receptor binding
RNA binding cell projection cell proliferation

ValWood commented 7 years ago

I ran the process terms on pombe proteins

These 701 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:

Probably you can get all genes to slim, but the aim is really to get complete coverage by aspect, not only by gene products

ValWood commented 7 years ago

It doesn't produce a very biologically informative process slim for fission yeast

701 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:

867 identifiers had no non-root annotations:

GO:0016043 cellular component organization 1209 GO:0050789 regulation of biological process 1069 GO:0019538 protein metabolic process 1005 GO:0051234 establishment of localization 929 GO:0050896 response to stimulus 667 GO:0007049 cell cycle 636 GO:0009056 catabolic process 485 GO:0000003 reproduction 330 GO:0006259 DNA metabolic process 301 GO:1901135 carbohydrate derivative metabolic process 271 GO:0006629 lipid metabolic process 222 GO:0042592 homeostatic process 153 GO:0005975 carbohydrate metabolic process 144 GO:0065009 regulation of molecular function 103 GO:0030154 cell differentiation 84 GO:0032502 developmental process 77 GO:0008219 cell death 4

ValWood commented 7 years ago

For a biologically useful generic slim I would recommend

  1. Exclude "biological regulation" (it isn't informative about specific processes)
  2. Swap protein metabolic process. Protein metabolic process includes "protein modification" which isn't informative about the process (biological role), its more about the function. i.e protein modifications participate in ALL processes
  3. Exclude "response to stimulus" ...probably everything is a response to some stimulus (signalling growth and division, transport, gene expression) at least in the way these terms are used in annotation, so the annotations are really an artefact of the curation process.
  4. Exclude "homeostatic process" ( see 3)

I don't think its useful to lump the slim terms for every aspect into a single slim. It encourages bad practice.....

Also I think sometime we include uninformative terms in slims just to improve coverage, when in fact they are not very biologically meaningful if you are a biologist analysing biological processes.

ValWood commented 7 years ago

Have we got all of the current terms there? I would expect "gene expression" to be included?

ValWood commented 7 years ago

If gene expression is included, much better coverage:

346 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim GO:0010467 gene expression 1366

ValWood commented 7 years ago

I thought I saw "small molecule metabolic process" in a previous version too. That would make sense. Otherwise other metabolism terms are required for complete coverage of cellular processes ( sulfur compound metabolic process, nucleobase-containing small molecule metabolic process, vitamin metabolic process, cofactor metabolic process, amino acid metabolic process)

pgaudet commented 7 years ago

I think gene expression is covered by 'nucleic acid binding transcription factor activity ' and 'DNA metabolic process' I tried to do all that @mdolanme asked - hopefully its right. @mdolanme can you please confirm ?

Thanks, Pascale

ValWood commented 7 years ago

DNA metabolic process and gene expression should be largely orthogonal....

pgaudet commented 7 years ago

Ah ok yes. I thought it was a child, but it's a child of 'GO:0043170 macromolecule metabolic process'.

Humm.

Pascale

ValWood commented 7 years ago

DNA metabolic process and gene expression should be largely orthogonal.... proof

gene_expression

only 19/1286 gene expression annotated to DNA metabolism

srengel commented 7 years ago

from discussion on today's GO Annotation call, an action item was to have new and/or junior curators review the terms to make sure none are difficult to understand. i asked one new curator and two curation assistants from SGD to have a look. there were no terms that didn't immediately make sense to them. this was not an official analysis by any means, just a quick gut-check. @thomaspd

hattrill commented 7 years ago

I have a couple of suggestions to make the ribbons easier on the eye.

  1. Order the terms in the ribbon such that similar themes are grouped (not alphabetically or by GO id).

  2. The cells can have names that differ from the term in the ribbon slim. We had a PI and some non-GO curators look over cell names and make them more "snappy" when we made ours. May gloss over some details, but makes it more digestible.

I've had a bit of a poke at the slim provided by Pascale, re-ordered the terms and suggested some alternative short cell names below:

MF Term name = short name
ligase activity = ligase hydrolase activity = hydrolase oxidoreductase activity = oxidoreductase enzyme regulator activity = enzyme regulator receptor activity = receptor receptor binding cytoskeletal protein binding nucleic acid binding transcription factor activity DNA binding RNA binding lipid binding carbohydrate binding carbohydrate derivative binding

BP Term name = short name
cell cycle cell proliferation cellular component organization = cellular organisization establishment of localization homeostatic process = homeostasis developmental process = development cell death cell differentiation reproduction immune system process = immune system nervous system process = neurological behavior response to stimulus catabolic process = catabolism protein metabolic process = protein metabolism lipid metabolic process = lipid metabolism DNA metabolic process = DNA metabolism carbohydrate metabolic process = carbohydrate metabolism carbohydrate derivative metabolic process = carbohydrate derivative metabolism regulation of biological process = process regulation regulation of molecular function = molecular function regulation

CC Term name = short name

extracellular region = extracellular cytosol cytoskeleton mitochondrion nucleus chromosome plasma membrane cell junction cell projection golgi apparatus endoplasmic reticulum endosome cytoplasmic vesicle protein complex

pgaudet commented 7 years ago

Hi @hattrill and others,

What do you suggest we do about Helen's recommended names ? I don't think we should rename the terms, since they help curators ensure consistence. It would be nice to have a synonym type 'display label'; @cmungall is this something we coul do ?

cmungall commented 7 years ago

Yes, certainly possible.

I think some of these are requests for primary label changes in GO, no need to diverge.

hattrill commented 7 years ago

Just to note - is more a suggestion of what can be done at the display end rather than be embedded within the ontology itself. (just seemed to fit with point of reviewing term names in the discussion.)

selewis commented 7 years ago

When the dust has settled on this (hopefully soon) please provide a simple 2-column tab-delimited text file of the ultimate AGR slim - where each row is GOID preferredlabel. Like this: GO:0005975 carbohydrate metabolic process

Should be a complete final list (not subject to change) Be nice if there were spacers between the 3 aspects, but not essential.

NO PDFs (what use are they??!)

And labels alone means I have to individually locate the ID to use in the query. That I could do (might need Pascale's help in some cases), but be fastest if the final list were comprehensive and not a set of changes (have these been made??) and IDs were included.

If someone really wanted to help they could issue a pull request on this file https://github.com/geneontology/ribbon/blob/master/src/data/agr.js after everyone is satisfied with it. Otherwise I'll do the formatting.

selewis commented 7 years ago

p.s. @ValWood I'll also see about adding those 2 extra bins you mentioned, but at a somewhat later stage...

judyblake commented 7 years ago

good progress here. I think that AGR ribbon and GO ribbon 'can' be different since to slim-up different sets of annotations. The visualization remains the same....On agenda for Oct meeting Cambridge..

On Thu, Aug 24, 2017 at 9:50 PM, Suzanna Lewis notifications@github.com wrote:

p.s. @ValWood https://github.com/valwood I'll also see about adding those 2 extra bins you mentioned, but at a somewhat later stage...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-ontology/issues/13791#issuecomment-324802788, or mute the thread https://github.com/notifications/unsubscribe-auth/AFE1106ZqtPsd2wwhCtAATr7k-nH8vjfks5sbih0gaJpZM4OMWRS .

-- Judy

cmungall commented 7 years ago

On 24 Aug 2017, at 18:48, Suzanna Lewis wrote:

please provide a simple 2-column tab-delimited text file of the ultimate AGR slim

Should be done generically for all slims and part of main release, Eric can work on this

mdolanme commented 6 years ago

NO PDFs (what use are they??!) I agree with Suzi -- but I believe the only PDF added here is mine containing the slides describing the slim construction process that I presented to the July 25th GO annotation call. I (or someone else) can easily provide the formatted list as Suzi suggests. The first list I submitted (via Harold on July 3rd) was 3-column tab-delimited text file: aspect, go_id, go_term.

selewis commented 6 years ago

Thanks Mary - will that be the modified list per all the discussion and feedback received?

Actually the code is organized so that a 'slim' parameter can passed in via the URL, so we can perhaps use this to let people directly compare what the ribbon looks like with different slims.

-S

p.s. Any chance you could provide JSON?

On Fri, Aug 25, 2017 at 10:32 AM, mdolanme notifications@github.com wrote:

NO PDFs (what use are they??!) I agree with Suzi -- but I believe the only PDF added here is mine containing the slides describing the slim construction process that I presented to the July 25th GO annotation call. I (or someone else) can easily provide the formatted list as Suzi suggests. The first list I submitted (via Harold on July 3rd) was 3-column tab-delimited text file: aspect, go_id, go_term.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-ontology/issues/13791#issuecomment-324987075, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEMdUa7Kf_VCO22RS8oRCouRHA7MLks5sbwUbgaJpZM4OMWRS .

mdolanme commented 6 years ago

The up-to-date AGR slim including the modifications from discussion and feedback is in the Gene Ontology subset 'goslim_agr'

selewis commented 6 years ago

Attachment? (I do that often)

On Fri, Aug 25, 2017 at 1:00 PM, mdolanme notifications@github.com wrote:

The up-to-date AGR slim including the modifications from discussion and feedback is in the Gene Ontology subset 'goslim_agr'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-ontology/issues/13791#issuecomment-325021471, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEK0__PNhHwGAoRXfZWaJxgHYF0Rwks5sbyfRgaJpZM4OMWRS .

mdolanme commented 6 years ago

I did not upload it when I posted my comment but here it is. AGRslim-from-GO.txt

selewis commented 6 years ago

Thanks Mary, that's perfect

On Mon, Aug 28, 2017 at 7:57 AM, mdolanme notifications@github.com wrote:

I did not upload it when I posted my comment but here it is. AGRslim-from-GO.txt https://github.com/geneontology/go-ontology/files/1257209/AGRslim-from-GO.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-ontology/issues/13791#issuecomment-325377734, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEMsWfXYCZHGbC5RQVnfiPSVWHZG2ks5sctVsgaJpZM4OMWRS .

ValWood commented 6 years ago

Is the text file linked above still the current AGR slim?

ValWood commented 6 years ago

Can it be in an OBO file/ updated with the other slims?

mdolanme commented 6 years ago

It is currently in the OBO file as subsetdef: goslim_agr "AGR slim" I believe it will be maintained and updated as are other slims

ValWood commented 6 years ago

brilliant, thanks

ValWood commented 6 years ago

@mdolanme So should it be in here? http://www.geneontology.org/ontology/subsets/ I don't see it?

ValWood commented 6 years ago

only experimental evidence

I noticed a comment at the top posted by @hdrabkin but might be from @judyblake So do you only use EXP annotation? Why is that? For a slim you would usually include everything to get the best coverage ....other wise many well-studied things will not slim ( even ribosomal protein etc are often not experimentally annotated).

mdolanme commented 6 years ago

I'm not familiar with the workflow for the slims. Pascale added the AGR slim terms and revised according to comments listed above following discussion of additions and removals. But, I agree, it should be handled in the same way as other slims -- at least, when the dust settles ;)