geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Review annotations using extension 'occurs_at' #2584

Closed pgaudet closed 1 year ago

pgaudet commented 5 years ago

Hello,

Occurs_at is not in the Relations Ontology, and its intended usage is the same as 'occurs_in', therefore we will deprecate occurs_at. Annotations should be reviewed and moved as suggested, if possible:

https://docs.google.com/spreadsheets/d/1LMsw75fwfKBi5H-BFt3opKVV1VuEcT7HUE34bHtOSq4/edit#gid=0

Please write a comment if these suggestions don't work for some of your annotations.

Impacts:

AgBase Alzheimers_University_of_Toronto ARUK-UCL BHF-UCL CAFA dictyBase GO_Central MGI NTNU_SB ParkinsonsUK-UCL PomBase SGD UniProt

colinlog commented 2 years ago

image

RLovering commented 2 years ago

as I wrote above: Annotations to GO:0000976 transcription cis-regulatory region binding should use has_input of the SO ID for the motif target.

other DNA binding terms can also include has_input SO ID for the motif target.

pgaudet commented 2 years ago
RLovering commented 2 years ago

370 UCL occurs_at annotations to revise: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL

266 UCL occurs_at(SO: annotations to revise: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:&assignedBy=BHF-UCL,ParkinsonsUK-UCL

of which 105 UCL annotations to revise: TF regulator activity or child terms with occurs at SO ID; the occurs_at(SO ID) AE needs to be removed from all of these: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

of which 135 UCL annotations to revise: DNA binding or child terms with occurs at SO ID; the occurs_at(SO ID) AE needs to be changed to has_input if these are motifs:https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

of which 53 UCL annotations include E-box binding SO - and often to E-box binding GO annotation: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0001158&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

change these to has_input(SO:0001158

of which 12 UCL annotations include telomeric_D_loop: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0002171

change these to has_input(SO:0002171

DONE: 4 UCL annotations manual delete SO:0100017 polypeptide_conserved_motif https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0100017

DONE: 1 UCL annotation manual delete SO:0100021 polypeptide_conserved_motif https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0100021

DONE: of which 5 UCL TF activity annotations include E-box binding SO https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0001158&goId=GO:0001228&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

delete these AEs

118 UCL annotations to revise: occurs_at(GO: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(GO:&assignedBy=BHF-UCL,ParkinsonsUK-UCL

NTNU annotations to revise 398 occurs_at annotations: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB

NTNU annotations to remove all 396 AE occurs_at (SO: annotations from TF regulator activity or child terms : https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

RLovering commented 2 years ago

copy of email sent to Astrid and Martin 27July2022 Dear Martin and Astrid

It has been agreed that all occurs_at annotations need to be removed or revised see https://github.com/geneontology/go-annotation/issues/2584

I still have 370 annotations that need to be revised and NTNU have 398 https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB

I am planning to ask Alex to help me with this and wondered if you would be happy for him to apply the same rules to your annotations to reduce the number of annotations that need to be edited manually.

396 annotations of your annotation extensions specify the region the dbTF binds using a SO ID. However, GO guidelines are now that: DNA-binding transcription factor activity (and child terms) will have annotation extensions that specify the gene regulated (using has_input UniProt/MOD ID) NOT SO IDs DNA binding (and child terms) can specify the target DNA bound motif using has_input (SO motif ID).

https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants

The remaining 2 annotations are to DNA binding terms: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB&goId=GO:0000979,GO:0000978&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants these need to be edited by hand: For: UniProtKB:Q9WUI0 Mixl1 enables GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding IDA PMID:19038793 occurs_at (SO:0000167) Change occurs_at (SO:0000167) promoter to the specific motif ID, although I think you might need to request this

For UniProtKB:P62380 TBPL1 enables GO:0000979 RNA polymerase II core promoter sequence-specific DNA binding IDA PMID:15767669 has_input (NCBI_Gene:4763) and occurs_at promoter_flanking_region (SO:0001952) Change NCBI_Gene:4763 to UniProtKB: P21359 Delete occurs_at promoter_flanking_region (SO:0001952) – as core promoter is implied by the annotation

It looks like the term: GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding Iis associated with all of the proteins you have annotated as dbTFs.

Attached is a download of your annotations it looks like only 3 SO IDs have been used: SO:0000165 enhancer SO:0000167 promoter SO:0001952 promoter_flanking_region None of these need to be used as the term: GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding covers these statements.

However the attached file also confirms that you have a lot of NCBI Gene ID information which needs to be converted to UniProtKB IDs, see : https://www.ebi.ac.uk/QuickGO/annotations?extension=has_input(NCBI&assignedBy=NTNU_SB

Please confirm ASAP that you are happy for Alex (if he is willing to do this) to remove all of your occurs_at(SO ID) annotation extensions and then hopefully he can do mine at the same time that he does yours

Hope you are all well

Best Ruth

RLovering commented 2 years ago

https://docs.google.com/spreadsheets/d/1zKgpgGsNzMGiPrDd0IKlwWWCXboZR3gS4UaxP_zE4AE/edit#gid=0

RLovering commented 2 years ago

reply from Martin Kuiper 28 July 2022: Dear Ruth,

If this can be done automatic then that is a good solution, Astrid and I agree with it. Marcio will do the manual annotations.

Best, Martin

RLovering commented 2 years ago

Note:

  1. Protein2GO does not have has_input SO ID as an option
  2. SO does not have the term: vitamin D response element whereas GO has GO:0070644 vitamin D response element binding
  3. SO does not have the term: serum response element whereas GO has GO:0010736 serum response element binding
  4. SO does not have the term: carbohydrate response element whereas GO has GO:0035538 carbohydrate response element binding
pgaudet commented 1 year ago

@alexsign

We would like to delete the following extensions:

Thanks, Pascale

pgaudet commented 1 year ago

There are 10 annotations by SGD left, as well as 8 BHF:

https://docs.google.com/spreadsheets/d/1LMsw75fwfKBi5H-BFt3opKVV1VuEcT7HUE34bHtOSq4/edit#gid=237931030

Please correct your annotations, changing to occurs_in if possible. If this is not possible, please explain what is needed.

Thanks, Pascale

RLovering commented 1 year ago

UCL complete but note that 2 annotations were probably from PomBase but were no longer present in Protein2GO so possibly Val had done these already? or QuickGO is out of sync with PomBase

Ruth

ValWood commented 1 year ago

I don't see any PomBAse ones i the spreadsheet. Is there still something for me to do here?

RLovering commented 1 year ago

These articles were curated by PomBase https://www.ebi.ac.uk/QuickGO/annotations?reference=PMID:9303310,PMID:28367989 but then GOC created ICs with occurs_at - very weird Ruth

ValWood commented 1 year ago

I'm also confused where this comes from. @pgaudet it's this annotation. Can you figure the source?

Screenshot 2023-07-05 at 15 14 45
colinlog commented 1 year ago

I believe 'we' will need to help SO make the 'SO motif set'. Correct me if I am wrong, but those are continuants being abstract bioinformatic objects that represent consensus DNA sequences that are bound by dbTFs.

For annotation and modelling, there are currently two different dbTF motif sets in the field:

I would argue that we should allow both the first and second type of motifs to be used when annotating BPs using MFs, even though it may be that the second are complex combinations of the first and even though there may be a better collection of motifs available at some point in the future.

Then, what we need in 2024 is for field specialists to discuss what is most opportune to include in SO's motif collection that can be used in GO-CAMs and therefore for GO annotation of dbTFs and their cofactors.

My intuition is that the actual DNA sequence is its 'true name' as that describes it best, whatever type of cis-acting DNA element it may be. Genomic coordinates, when they have been determined, allow retrieval of the 'true name' for a cis-regulatory DNA sequence. Hence, I think it is important that GO-CAMs permit storage of machine-readable (bed format) genomic coordinates. Note that when reaching 20 or more nucleotides in length, the DNA sequences are likely to be absent, unique or under evolutionary selection to be present in larger numbers in a (human) genome..

To put these matters further in perspective: if introduced in the genome to replace an existing motif instance, the above first and second type of motifs would be 'too optimal' because they would bind their target transcription factor protein(complexes) too tightly to allow cellular homeostasis. Most observed functional genomic binding sites are sub-optimal and therefore deviate from the consensus motifs.

The modelling field still needs to establish standards to simulate cellular transcription programs.

Perhaps that coordinating this is a task for the successor of GREEKC?

On Wed, Aug 10, 2022 at 6:00 PM Ruth Lovering @.***> wrote:

Note:

  1. Protein2GO does not have has_input SO ID as an option
  2. SO does not have the term: vitamin D response element whereas GO has GO:0070644 vitamin D response element binding

— Reply to this email directly, view it on GitHub https://github.com/geneontology/go-annotation/issues/2584#issuecomment-1210912825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZVLKFJFBRLS56Y66K3SNLVYPG25ANCNFSM4ILTYX3A . You are receiving this because you were mentioned.Message ID: @.***>

alexsign commented 1 year ago

@pgaudet @RLovering the requested extensions mostly updated now. Bellow is the list of annotations which extension was not updated because of the duplication of data or it's too complex. They need to be removed/updated manually.

AB F1SKN9 GO:0000981 ECO:0000250 occurs_at(SO:0000165) NTNU Q04206 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q08369 GO:0000981 ECO:0000314 occurs_at(SO:0000165) NTNU Q62424 GO:0001228 ECO:0000305 occurs_at(SO:0000165) NTNU Q8CFN5 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q9H2W2 GO:0001228 ECO:0000250 occurs_at(SO:0000165) BHFL Q9H2W2 GO:0001228 ECO:0000250 occurs_at(SO:0000165) BHFL Q9WUI0 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q9WUI0 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL O14503 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL O14503 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL O35185 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q99PV5 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q99PV5 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q9JM73 GO:0001228 ECO:0000314 occurs_at(SO:0001952) PARL Q8TE12 GO:0000977 ECO:0000250 occurs_at(SO:0000167),has_input(UniProtKB:P04628),happens_during(GO:1990403),happens_during(GO:0030901)|occurs_at(SO:0000167),has_input(UniProtKB:P43354)|occurs_at(SO:0000167),has_input(UniProtKB:O75364)|occurs_at(SO:0000167),has_input(UniProtKB:O60663)|occurs_at(SO:0000167),has_input(UniProtKB:Q8TE12)|occurs_at(SO:0000167),has_input(UniProtKB:P28360) BHFL Q63934 GO:0043565 ECO:0000314 has_input(UniProtKB:P14142),occurs_at(SO:0001055),occurs_in(UBERON:0001013)|has_input(UniProtKB:P14142),occurs_at(SO:0001055),occurs_in(UBERON:0001134)

RLovering commented 1 year ago

Thanks so much Alex for all your help with this.

UCL ones all done now

Just 1 AB and 3 NTNUs to go Best Ruth

pgaudet commented 1 year ago

NTNU Q04206 GO:0001228 ECO:0000314 occurs_at(SO:0000165) >> redundant with another annotation; delete NTNU Q62424 GO:0001228 ECO:0000305 occurs_at(SO:0000165) >> redundant with another annotation; delete NTNU Q8CFN5 GO:0001228 ECO:0000314 occurs_at(SO:0000165) >> redundant with another annotation; delete

pgaudet commented 1 year ago

Alex will remove redundant annotations. All occurs_at annotations have been dealt with.

pfey03 commented 1 year ago

I have Dicty already deleted