geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
219 stars 40 forks source link

Group "complex terms" under nucleus #19952

Closed ValWood closed 3 years ago

ValWood commented 4 years ago

Would it be possible to group all of the complex terms directly under "nucleus" under a grouping term "nuclear complex" (there is already a similar grouping term for "nuclear membrane complex"). I ask because there are so many complexes here that is is difficult to locate the nuclear parts hidden among them, and our users are not finding them......

This is only a few:

Screenshot 2020-09-06 at 09 33 06
pgaudet commented 4 years ago

Some of them also also prokaryotic - for example the MutL MutS complexes I think are also found in bacteria.

@AndreaAuchincloss

would you please have a look at this list to see if you spot any bacterial complexes ?

Thanks, Pascale

pgaudet commented 4 years ago

@keseler if you also want to comment that'd be much appreciated !

AndreaAuchincloss commented 4 years ago

MutL and MutS proteins exist in bacteria and archaea, and are similar enough to their eukaryotic counterparts that they share InterPro (etc) domains. In E.coli the complex is different from eukaryotes, there is a MutHLS complex annotated in EcoCyc (there are other subunits, looking at Paul Modrich's Nobel lecture UvrD and some exo- and endonucleases are also involved). MutH does not exist in eukaryotes. None of the 5 Mut terms Val proposes to group under "nuclear complex" would be appropriate for this bacterial complex anyway because they're too eukaryotic just by definition.

The day someone wants to annotate the bacterial mismatch repair complex in GO they could update GO:1990710 (MutS complex, it has E.coli in the comments), so grouping the above 5 MutL and MutS terms under nuclear complex doesn't pose a problem for me. You might want to alter the definitions so they are explicitly eukaryotic (although that should be evident from the ancestor chart).

I looked at the other Child terms of GO:0005634 (nucleus) and checked a few of them for bacterial annotation. None of them ring a bell as being bacterial; my knowledge is NOT encyclopedic, so I may have missed something.

In summary as far as I can tell this grouping is fine.

pgaudet commented 3 years ago

I suppose we'll want ER complex, mitochondria complex, chloroplast complex ?

ValWood commented 3 years ago

This would be useful for curators. It's really easy to see cellular locations when drilling down because of all the complex terms.

keseler commented 3 years ago

Sorry this took me a while. I agree with Andrea, the complex terms in the list seem fine.

If you are going to create additional higher-level complex terms for the locations of the complexes, how about cytoplasmic complex, membrane complex, periplasmic complex, extracellular complex? These would be useful for prokaryotes.

krchristie commented 3 years ago

We should think about this idea very carefully. I seem to recall that we used to try to classify complexes by location, and then when a complex moves around and can be in more than one place, we ended up with multiple terms for a complex depending on its localization. I'm just not sure that this info really belongs in the ontology.

deustp01 commented 3 years ago

I'm just not sure that this info really belongs in the ontology.

Isn't that the situation that now can be handled as an annotation in GO-CAM, so there's no longer a need for composed terms like nuclear_ribosomal subunit and also cytosolic_ribosomal subunit and ER-associated_ribosomal subunit?

ValWood commented 3 years ago

I find the groupings quite useful for ontology browsing. Without some grouping, it is difficult to locate the buried non-complex terms buried in the mass of complexes. In addition, these complexes have the parent "nucleus" so based on this logic, we should move all complexes from under any location which would bump them all up to the root node, so I'm not sure this was ever the plan?

I thought the plan was that if a complex had multiple locations, then it would not get a location specific term, but if the only location of function for a complex was a particular compartment it could be added.

IIRC the exact problem was that we have "duplicate' terms for complexes at different locations. For example, we have "nuclear proteasome" and "cytosolic proteasome". The discussion I remember was to merge these into a single proteasome term with no location specific version (which I now see is the issue that @deustp01 refers to)

Also, we already have location specific groupings, for example: GO:0106083 nuclear membrane protein complex GO:0072546 ER membrane protein complex GO:0098799 outer mitochondrial membrane protein complex

It seems reasonable that if a complex has a single 'resident' location then an ancestor to that location fills in annotation gaps and helps annotation consistenccy? If these groupings are removed I will need to make sure all of the complexes also have their annotated location, because often times it is implicit from the complexposition in the heirarchy.

In summary, if these complex can't be housed under a "nuclear complex" grouping term, they shouldn't be under "nucelus" in the first place...

ValWood commented 3 years ago

@krchristie do you remember when the discussion about merging nuclear and cytoplasmic versions of the same complex occurred? I know we have discussed a few times and agreed to implement (over the years). However, I cannot find any relevant GO ticket for this....

@vanaukenk maybe you remember this?

krchristie commented 3 years ago

While I understand that it can be helpful when browsing to have things grouped under a given term, it became apparent to me when examining ChEBI roles that we have to think about these groupings from a couple different perspectives. So, if we place complexes that are only ever found in the nucleus under the term nucleus, but we do not place terms that have multiple locations under that term, then the list of complexes under the term nucleus is only a partial list. This is a problem for browsing if someone is thinking that the complex they are looking for is located in the nucleus without knowing that it is also found in another location as well.

More problematically, when someone does an enrichment analysis and looks at the term nucleus, they might expect they would get everything that is present in the nucleus, but they will not get the gene products annotated to complexes that that are in the nucleus only some of the time and somewhere else at other times.

I am wondering if it would be best if we did not assign complexes to any cellular structures within the ontology.

krchristie commented 3 years ago

do you remember when the discussion about merging nuclear and cytoplasmic versions of the same complex occurred? I know we have discussed a few times and agreed to implement (over the years). However, I cannot find any relevant GO ticket for this....

@vanaukenk maybe you remember this?

I think this was quite a long time ago, probably when I was at SGD so more than 8 years ago. I remember pretty clearly that discussion on the related issue of terms like these TFIIH core complex terms when it is present in either of the two different complex that the TFIIH core can be part of occurred when I was at SGD. Note that these specific subportion terms are only a fraction of the total annotations to a TFIIH core complex term, with the portion of NEF3 complex term not used at all.

-- transcription factor TFIIH core complex (721 annotations, 27 exp) --- core TFIIH complex portion of holo TFIIH complex (17 annots, 3 exp) --- core TFIIH complex portion of NEF3 complex (0 annotations)

krchristie commented 3 years ago

@ValWood - when you want to tag me, the appropriate handle is @krchristie. You tagged a different K Christie above.

ValWood commented 3 years ago

So, if we place complexes that are only ever found in the nucleus under the term nucleus, but we do not place terms that have multiple locations under that term, then the list of complexes under the term nucleus is only a partial list.

but this is true for every term in every ontology?

This is a problem for browsing if someone is thinking that the complex they are looking for is located in the nucleus without knowing that it is also found in another location as well.

I see your point, but If you were looking for a specific complex you would find it by searching. This could be mitigated byt fdefining location specific grouping terms as "A complex for which the only location of action is the blah" (or similar)

More problematically, when someone does an enrichment analysis and looks at the term nucleus, they might expect they would get everything that is present in the nucleus, but they will not get the gene products annotated to complexes that are in the nucleus only some of the time and somewhere else at other times.

but here the onus is on the curator to annotate the correct location (which usually comes from different experiments anyway). You can't depend on a complex to provide the location, but it is a nice 'backstop' if the location annotation is ommitted. If we removed the 'location specific complexes' I'm sure we would lost a lot of valuable annotation.

I am wondering if it would be best if we did not assign complexes to any cellular structures within the ontology.

Pascale raised this option uesterday and it seems to be something under discussion (i.e putting complexes in their own ontology branch, which is essentially what you are suggesting).

I would support this change but there would be a lot of "gap filling annotation" required. The other main problem I envisage is whether all complexes (kinetochore, ribosome, spliceosome?) would be included

ValWood commented 3 years ago

I remember the related discussion about TFIIH too, but I also specifically remember a discussion about multi-location versions of the same complex

ValWood commented 3 years ago

Whilst looking for tickets with the label "GOC_meeting" I stumbled across the ticket about merging nuclear and cytoplasmic versions of the same complex!!! It was opened in Nov 2016

https://github.com/geneontology/go-ontology/issues/12833

ValWood commented 3 years ago

I will try to explain the problem for curators. You wna to see if your specific nuclear location has already been described. However, you don't know the terminology that might be used so you looks at the descendants of nucleus. This is what you see. It would. be really nice if you could just see the locations without needing to browse the complexes.

It did not seem so controversial here to add a complex grouping term becasue a) they are already under nucleus and under complex b) these location specific grouping terms exist elsewhere.

GO:0002111    BRCA2-BRAF35 complex part_of
GO:0031601    nuclear proteasome core complex part_of
GO:0043073    germ cell nucleus is_a
GO:0031519    PcG protein complex part_of
GO:0070532    BRCA1-B complex part_of
GO:0034981    FHL3-CREB complex part_of
GO:0071664    catenin-TCF7L2 complex part_of
GO:0097572    right nucleus is_a
GO:0000790    nuclear chromatin part_of
GO:0031613    nuclear proteasome regulatory particle, lid subcomplex part_of
GO:0000794    condensed nuclear chromosome part_of
GO:0071204    histone pre-mRNA 3'end processing complex part_of
GO:0005958    DNA-dependent protein kinase-DNA ligase 4 complex part_of
GO:1990590    ATF1-ATF4 transcription factor complex part_of
GO:0000798    nuclear cohesin complex part_of
GO:0031380    nuclear RNA-directed RNA polymerase complex part_of
GO:0000214    tRNA-intron endonuclease complex part_of
GO:0033063    Rad51B-Rad51C-Rad51D-XRCC2 complex part_of
GO:0043599    nuclear DNA replication factor C complex part_of
GO:0031039    macronucleus is_a
GO:1990477    NURS complex part_of
GO:0000109    nucleotide-excision repair complex part_of
GO:0034692    E.F.G complex part_of
GO:0071144    heteromeric SMAD protein complex part_of
GO:0035145    exon-exon junction complex part_of
GO:0070313    RGS6-DNMT1-DMAP1 complex part_of
GO:0031533    mRNA cap methyltransferase complex part_of
GO:0031598    nuclear proteasome regulatory particle part_of
GO:0031510    SUMO activating enzyme complex part_of
GO:0098537    lobed nucleus is_a
GO:0034978    PDX1-PBX1b-MRG1 complex part_of
GO:0062128    MutSgamma complex part_of
GO:0033597    mitotic checkpoint complex part_of
GO:0005635    nuclear envelope part_of
GO:0000228    nuclear chromosome part_of
GO:0043564    Ku70:Ku80 complex part_of
GO:0110092    nucleus leading edge part_of
GO:0070516    CAK-ERCC2 complex part_of
GO:0070767    BRCA1-Rad51 complex part_of
GO:0070418    DNA-dependent protein kinase complex part_of
GO:0070354    GATA2-TAL1-TCF3-Lmo2 complex part_of
GO:0005677    chromatin silencing complex part_of
GO:0043076    megasporocyte nucleus is_a
GO:0070531    BRCA1-A complex part_of
GO:0034980    FHL2-CREB complex part_of
GO:1990513    CLOCK-BMAL transcription complex part_of
GO:0097571    left nucleus is_a
GO:0033620    Mei2 nuclear dot complex part_of
GO:0030870    Mre11 complex part_of
GO:0032116    SMC loading complex part_of
GO:0031618    nuclear pericentric heterochromatin part_of
GO:0031981    nuclear lumen part_of
GO:0048353    primary endosperm nucleus is_a
GO:0035059    RCAF complex part_of
GO:0000943    retrotransposon nucleocapsid part_of
GO:0033064    XRCC2-RAD51D complex part_of
GO:0030689    Noc complex part_of
GO:1990453    nucleosome disassembly/reassembly complex part_of
GO:0051457    maintenance of protein location in nucleus occurs_in
GO:1990378    upstream stimulatory factor complex part_of
GO:0005666    RNA polymerase III complex part_of
GO:0005681    spliceosomal complex part_of
GO:0000346    transcription export complex part_of
GO:0000418    RNA polymerase IV complex part_of
GO:0071027    nuclear RNA surveillance occurs_in
GO:0031595    nuclear proteasome complex part_of
GO:0043601    nuclear replisome part_of
GO:0000176    nuclear exosome (RNase complex) part_of
GO:0070876    SOSS complex part_of
GO:0046818    dense nuclear body part_of
GO:0005697    telomerase holoenzyme complex part_of
GO:0030532    small nuclear ribonucleoprotein complex part_of
GO:0070353    GATA1-TAL1-TCF3-Lmo2 complex part_of
GO:0033203    DNA helicase A complex part_of
GO:0089701    U2AF complex part_of
GO:0070847    core mediator complex part_of
GO:0070557    PCNA-p21 complex part_of
GO:0034064    Tor2-Mei2-Ste11 complex part_of
GO:0033260    nuclear DNA replication occurs_in
GO:0042405    nuclear inclusion body part_of
GO:1990512    Cry-Per complex part_of
GO:0032301    MutSalpha complex part_of
GO:0110093    nucleus lagging edge part_of
GO:0048189    Lid2 complex part_of
GO:0031040    micronucleus is_a
GO:0070274    RES complex part_of
GO:0032389    MutLalpha complex part_of
GO:0034753    nuclear aryl hydrocarbon receptor complex part_of
GO:0140510    mitotic nuclear bridge part_of
GO:0070421    DNA ligase III-XRCC1 complex part_of
GO:0070467    RC-1 DNA recombination complex part_of
GO:0031610    nuclear proteasome regulatory particle, base subcomplex part_of
GO:0033065    Rad51C-XRCC3 complex part_of
GO:0046536    dosage compensation complex part_of
GO:1990433    CSL-Notch-Mastermind transcription factor complex part_of
GO:0033167    ARC complex part_of
GO:0032039    integrator complex part_of
GO:0005640    nuclear outer membrane part_of
GO:1990354    activated SUMO-E1 ligase complex part_of
GO:0043224    nuclear SCF ubiquitin ligase complex part_of
GO:0000818    nuclear MIS12/MIND complex part_of
GO:0032807    DNA ligase IV complex part_of
GO:1990589    ATF4-CREB1 transcription factor complex part_of
GO:0000347    THO complex part_of
GO:0000780    condensed nuclear chromosome, centromeric region part_of
GO:0000419    RNA polymerase V complex part_of
GO:0055029    nuclear DNA-directed RNA polymerase complex part_of
GO:0000152    nuclear ubiquitin ligase complex part_of
GO:0070877    microprocessor complex part_of
GO:0000784    nuclear chromosome, telomeric region part_of
GO:0031607    nuclear proteasome core complex, beta-subunit complex part_of
GO:0048555    generative cell nucleus is_a
GO:0000788    nuclear nucleosome part_of
GO:0070390    transcription export complex 2 part_of
GO:1905754    ascospore-type prospore nucleus is_a
GO:0019908    nuclear cyclin-dependent protein kinase holoenzyme complex part_of
GO:1902375    nuclear tRNA 3'-trailer cleavage, endonucleolytic occurs_in
GO:0005880    nuclear microtubule part_of
GO:0070310    ATR-ATRIP complex part_of
GO:1902377    nuclear rDNA heterochromatin part_of
GO:0045120    pronucleus is_a
GO:0070533    BRCA1-C complex part_of
GO:0071033    nuclear retention of pre-mRNA at the site of transcription occurs_in
GO:0030895    apolipoprotein B mRNA editing enzyme complex part_of
GO:0062119    LinE complex part_of
GO:0070693    P-TEFb-cap methyltransferase complex part_of
GO:0032302    MutSbeta complex part_of
GO:0033062    Rhp55-Rhp57 complex part_of
GO:0033066    Rad51B-Rad51C complex part_of
GO:0043596    nuclear replication fork part_of
GO:0046808    assemblon part_of
GO:0070552    BRISC complex part_of
GO:0032806    carboxy-terminal domain protein kinase complex part_of
GO:0097165    nuclear stress granule part_of
GO:0008180    COP9 signalosome part_of
GO:0032390    MutLbeta complex part_of
GO:0000439    transcription factor TFIIH core complex part_of
GO:0000783    nuclear telomere cap complex part_of
GO:0031604    nuclear proteasome core complex, alpha-subunit complex part_of
GO:0032545    CURI complex part_of
GO:0031499    TRAMP complex part_of
GO:0048556    microsporocyte nucleus is_a
ValWood commented 3 years ago

if we place complexes that are only ever found in the nucleus under the term nucleus, but we do not place terms that have multiple locations under that term, then the list of complexes under the term nucleus is only a partial list. This is a problem for browsing if someone is thinking that the complex they are looking for is located in the nucleus without knowing that it is also found in another location as well.

This is what normally happens, for example for all DNA replication complexes. These cannot be placed under nucleus, so the onus is on the curator to make the appropriate nuclear annotation

Screenshot 2020-09-23 at 08 40 25
krchristie commented 3 years ago

I will try to explain the problem for curators. You wna to see if your specific nuclear location has already been described. However, you don't know the terminology that might be used so you looks at the descendants of nucleus. This is what you see. It would. be really nice if you could just see the locations without needing to browse the complexes.

@ValWood - I totally get this problem. However, I think you also understand the ontological reasons why trying to code locations of complexes in the ontology is not a solution because of the issue that happens for complexes like the DNA replication complexes you mentioned that can not be placed under the term nucleus because not all DNA replication complexes are present in a nucleus. Then, when we try to solve that issue by creating terms for things like nuclear DNA replication complex and cytoplasmic DNA replication complex, curators rarely use these terms because they get to the phrase that matches what they see in the paper, i.e. DNA replication complex and never even notice the more granular terms that have locations coded into them.

Personally, I think that we have already seen that trying to make these location coded complex terms is not an effective solution to the problem of helping curators find a complex by location. The constraints of making the ontology always true really make it problematic to use as a browsing tool. I think we need to come up with a different solution to help curators find this kind of information. I think that the Protein2GO suggestions of co-annotations might be a good direction to think about. It would be cool if curators/users could easily browse/search for complex terms cross-referenced to existing annotations, e.g. complexes known to be found in the nucleus. We probably also need to change the paradigm so that instead of trying to create one term that does it all, we will use a set of terms. GO-CAMs allow us to do this.

ValWood commented 3 years ago

I don't think we should have location specific complexes though? But if a complex is in the ontology under as location it should be under a location that is always true. This is a slightly ortholognal discussion to this request which is to group the existing terms (there is no change in any meaning by doing this).

If complexes remain in the location branch they should be in the correct place. An alternative is to move the complex terms outside of the CC aspect (i.e separate complexes and locations).

deustp01 commented 3 years ago

I don't think we should have location specific complexes though

Isn't this the point (as Karen said two comments up) - the issue should be handled by annotation complex (GO:cell_component) is part_of GO: cell_component, not by creation of new compound ontology terms?

And to get fussy, will those hypothetical always-in-the nucleus terms behave properly for complexes that occur in taxa with open mitosis?

ValWood commented 3 years ago

I am confused. We are trying to remove the compound ontology terms.

The new ticket is here: https://github.com/geneontology/go-ontology/issues/20000

replacing the ticket that was opened in 2016 https://github.com/geneontology/go-ontology/issues/12833

but the existence of "location specific complex terms" is a different issue from the suggestion in this ticket to group the complexes that are already under nucleeus- and still belong here (if they don't belong here the nuclear parentage should be removed).

An alternative suggestion is to remove all of the complexes from out of CC into their own GO aspect. This is unlikely to happen immediately.

Whatever happens in the long term it would be useful not to see a long list of complexes directly under nucleus, and the timescales for the other issues will likely be much longer.

Note that all of these complexes are already under nucleus, so grouping them only follows patterns used elsewhere.

Merging complexes that are not always nuclear (see https://github.com/geneontology/go-ontology/issues/20000) will result in their removal from under nucleus. This should not be surprising to curators. If a protein complex has multiple locations of action it cannot exist in the ontology as a descendant of one of those locations.

This issue is about something else...