Closed pgaudet closed 1 year ago
as I wrote above: Annotations to GO:0000976 transcription cis-regulatory region binding should use has_input of the SO ID for the motif target.
other DNA binding terms can also include has_input SO ID for the motif target.
370 UCL occurs_at annotations to revise: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL
266 UCL occurs_at(SO: annotations to revise: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:&assignedBy=BHF-UCL,ParkinsonsUK-UCL
of which 105 UCL annotations to revise: TF regulator activity or child terms with occurs at SO ID; the occurs_at(SO ID) AE needs to be removed from all of these: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
of which 135 UCL annotations to revise: DNA binding or child terms with occurs at SO ID; the occurs_at(SO ID) AE needs to be changed to has_input if these are motifs:https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=BHF-UCL,ParkinsonsUK-UCL&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
of which 53 UCL annotations include E-box binding SO - and often to E-box binding GO annotation: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0001158&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
change these to has_input(SO:0001158
of which 12 UCL annotations include telomeric_D_loop: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0002171
change these to has_input(SO:0002171
DONE: 4 UCL annotations manual delete SO:0100017 polypeptide_conserved_motif https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0100017
DONE: 1 UCL annotation manual delete SO:0100021 polypeptide_conserved_motif https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0100021
DONE: of which 5 UCL TF activity annotations include E-box binding SO https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:0001158&goId=GO:0001228&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
delete these AEs
118 UCL annotations to revise: occurs_at(GO: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(GO:&assignedBy=BHF-UCL,ParkinsonsUK-UCL
NTNU annotations to revise 398 occurs_at annotations: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB
NTNU annotations to remove all 396 AE occurs_at (SO: annotations from TF regulator activity or child terms : https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
copy of email sent to Astrid and Martin 27July2022 Dear Martin and Astrid
It has been agreed that all occurs_at annotations need to be removed or revised see https://github.com/geneontology/go-annotation/issues/2584
I still have 370 annotations that need to be revised and NTNU have 398 https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB
I am planning to ask Alex to help me with this and wondered if you would be happy for him to apply the same rules to your annotations to reduce the number of annotations that need to be edited manually.
396 annotations of your annotation extensions specify the region the dbTF binds using a SO ID. However, GO guidelines are now that: DNA-binding transcription factor activity (and child terms) will have annotation extensions that specify the gene regulated (using has_input UniProt/MOD ID) NOT SO IDs DNA binding (and child terms) can specify the target DNA bound motif using has_input (SO motif ID).
The remaining 2 annotations are to DNA binding terms: https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&assignedBy=NTNU_SB&goId=GO:0000979,GO:0000978&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants these need to be edited by hand: For: UniProtKB:Q9WUI0 Mixl1 enables GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding IDA PMID:19038793 occurs_at (SO:0000167) Change occurs_at (SO:0000167) promoter to the specific motif ID, although I think you might need to request this
For UniProtKB:P62380 TBPL1 enables GO:0000979 RNA polymerase II core promoter sequence-specific DNA binding IDA PMID:15767669 has_input (NCBI_Gene:4763) and occurs_at promoter_flanking_region (SO:0001952) Change NCBI_Gene:4763 to UniProtKB: P21359 Delete occurs_at promoter_flanking_region (SO:0001952) – as core promoter is implied by the annotation
It looks like the term: GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding Iis associated with all of the proteins you have annotated as dbTFs.
Attached is a download of your annotations it looks like only 3 SO IDs have been used: SO:0000165 enhancer SO:0000167 promoter SO:0001952 promoter_flanking_region None of these need to be used as the term: GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding covers these statements.
However the attached file also confirms that you have a lot of NCBI Gene ID information which needs to be converted to UniProtKB IDs, see : https://www.ebi.ac.uk/QuickGO/annotations?extension=has_input(NCBI&assignedBy=NTNU_SB
Please confirm ASAP that you are happy for Alex (if he is willing to do this) to remove all of your occurs_at(SO ID) annotation extensions and then hopefully he can do mine at the same time that he does yours
Hope you are all well
Best Ruth
reply from Martin Kuiper 28 July 2022: Dear Ruth,
If this can be done automatic then that is a good solution, Astrid and I agree with it. Marcio will do the manual annotations.
Best, Martin
Note:
@alexsign
We would like to delete the following extensions:
Thanks, Pascale
There are 10 annotations by SGD left, as well as 8 BHF:
Please correct your annotations, changing to occurs_in if possible. If this is not possible, please explain what is needed.
Thanks, Pascale
UCL complete but note that 2 annotations were probably from PomBase but were no longer present in Protein2GO so possibly Val had done these already? or QuickGO is out of sync with PomBase
Ruth
I don't see any PomBAse ones i the spreadsheet. Is there still something for me to do here?
These articles were curated by PomBase https://www.ebi.ac.uk/QuickGO/annotations?reference=PMID:9303310,PMID:28367989 but then GOC created ICs with occurs_at - very weird Ruth
I'm also confused where this comes from. @pgaudet it's this annotation. Can you figure the source?
I believe 'we' will need to help SO make the 'SO motif set'. Correct me if I am wrong, but those are continuants being abstract bioinformatic objects that represent consensus DNA sequences that are bound by dbTFs.
For annotation and modelling, there are currently two different dbTF motif sets in the field:
First, we have experimentally derived consensus sequences that are bound by recombinant MONOMERIC or HOMO-MULTIMERIC human dbTF proteins (almost all) produced in E.coli bacteria. There are about ~500, that Ruth refers to and that were obtained by high throughput SELEX-type experiments. These motifs are continuants.
Second, there are experimentally characterised instances of native genomic sequences that have been identified in cellular gene enhancers, promoters and chromatin loop anchors. Searching for consensus sequences at the centre of all binding sites observed for transcription factors within a genome also yields 'motifs' that are abstracted sequences - continuants. Many of these are compound instances of two or more consensus motifs named in 1 that are bound in cellulo by HETERODIMERIC or HETEROMULTIMERIC dbTF complexes. Furthermore, some of these are cell-type-specific, in the sense that chromatin immunoprecipitation of one same dbTF with the same antibody can yield consensus DNA binding site sequences that are rather different, depending on the cell type that is studied. This reflects alternative transcription programs rooted in the interdependence of multiple transcription factors at given gene promoters, enhancers or chromatin loop anchors.
Third, we have the DNA sequences themselves, with actual genomic coordinates and the potential to be assigned target genes in the genome and binding proteins in the proteome (usually dbTFs and their cofactors). The DNA sequences themselves have particular value for modelling. It is therefore important to annotate them adequately using GO terms to instantiate models of the process of cellular transcription regulation. These are occurants. These should be modelled as GO-CAMs using SO-equivalent terms.These need qualifiers such as species, cell type, cell state and temporal phase.
I would argue that we should allow both the first and second type of motifs to be used when annotating BPs using MFs, even though it may be that the second are complex combinations of the first and even though there may be a better collection of motifs available at some point in the future.
Then, what we need in 2024 is for field specialists to discuss what is most opportune to include in SO's motif collection that can be used in GO-CAMs and therefore for GO annotation of dbTFs and their cofactors.
My intuition is that the actual DNA sequence is its 'true name' as that describes it best, whatever type of cis-acting DNA element it may be. Genomic coordinates, when they have been determined, allow retrieval of the 'true name' for a cis-regulatory DNA sequence. Hence, I think it is important that GO-CAMs permit storage of machine-readable (bed format) genomic coordinates. Note that when reaching 20 or more nucleotides in length, the DNA sequences are likely to be absent, unique or under evolutionary selection to be present in larger numbers in a (human) genome..
To put these matters further in perspective: if introduced in the genome to replace an existing motif instance, the above first and second type of motifs would be 'too optimal' because they would bind their target transcription factor protein(complexes) too tightly to allow cellular homeostasis. Most observed functional genomic binding sites are sub-optimal and therefore deviate from the consensus motifs.
The modelling field still needs to establish standards to simulate cellular transcription programs.
Perhaps that coordinating this is a task for the successor of GREEKC?
On Wed, Aug 10, 2022 at 6:00 PM Ruth Lovering @.***> wrote:
Note:
- Protein2GO does not have has_input SO ID as an option
- SO does not have the term: vitamin D response element whereas GO has GO:0070644 vitamin D response element binding
— Reply to this email directly, view it on GitHub https://github.com/geneontology/go-annotation/issues/2584#issuecomment-1210912825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZVLKFJFBRLS56Y66K3SNLVYPG25ANCNFSM4ILTYX3A . You are receiving this because you were mentioned.Message ID: @.***>
@pgaudet @RLovering the requested extensions mostly updated now. Bellow is the list of annotations which extension was not updated because of the duplication of data or it's too complex. They need to be removed/updated manually.
AB F1SKN9 GO:0000981 ECO:0000250 occurs_at(SO:0000165) NTNU Q04206 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q08369 GO:0000981 ECO:0000314 occurs_at(SO:0000165) NTNU Q62424 GO:0001228 ECO:0000305 occurs_at(SO:0000165) NTNU Q8CFN5 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q9H2W2 GO:0001228 ECO:0000250 occurs_at(SO:0000165) BHFL Q9H2W2 GO:0001228 ECO:0000250 occurs_at(SO:0000165) BHFL Q9WUI0 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL Q9WUI0 GO:0001228 ECO:0000314 occurs_at(SO:0000165) BHFL O14503 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL O14503 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL O35185 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q99PV5 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q99PV5 GO:0001227 ECO:0000314 occurs_at(SO:0001952) BHFL Q9JM73 GO:0001228 ECO:0000314 occurs_at(SO:0001952) PARL Q8TE12 GO:0000977 ECO:0000250 occurs_at(SO:0000167),has_input(UniProtKB:P04628),happens_during(GO:1990403),happens_during(GO:0030901)|occurs_at(SO:0000167),has_input(UniProtKB:P43354)|occurs_at(SO:0000167),has_input(UniProtKB:O75364)|occurs_at(SO:0000167),has_input(UniProtKB:O60663)|occurs_at(SO:0000167),has_input(UniProtKB:Q8TE12)|occurs_at(SO:0000167),has_input(UniProtKB:P28360) BHFL Q63934 GO:0043565 ECO:0000314 has_input(UniProtKB:P14142),occurs_at(SO:0001055),occurs_in(UBERON:0001013)|has_input(UniProtKB:P14142),occurs_at(SO:0001055),occurs_in(UBERON:0001134)
Thanks so much Alex for all your help with this.
UCL ones all done now
Just 1 AB and 3 NTNUs to go Best Ruth
NTNU Q04206 GO:0001228 ECO:0000314 occurs_at(SO:0000165) >> redundant with another annotation; delete NTNU Q62424 GO:0001228 ECO:0000305 occurs_at(SO:0000165) >> redundant with another annotation; delete NTNU Q8CFN5 GO:0001228 ECO:0000314 occurs_at(SO:0000165) >> redundant with another annotation; delete
Alex will remove redundant annotations. All occurs_at annotations have been dealt with.
I have Dicty already deleted
Hello,
Occurs_at is not in the Relations Ontology, and its intended usage is the same as 'occurs_in', therefore we will deprecate occurs_at. Annotations should be reviewed and moved as suggested, if possible:
https://docs.google.com/spreadsheets/d/1LMsw75fwfKBi5H-BFt3opKVV1VuEcT7HUE34bHtOSq4/edit#gid=0
Please write a comment if these suggestions don't work for some of your annotations.
Impacts:
AgBase Alzheimers_University_of_Toronto ARUK-UCL BHF-UCL CAFA dictyBase GO_Central MGI NTNU_SB ParkinsonsUK-UCL PomBase SGD UniProt