Closed pgaudet closed 1 year ago
@pgaudet in this linked table here there is nothing from Dicty. In the related issue, (https://github.com/geneontology/go-ontology/issues/17713) I looked yesterday, there are 3 Dicty, but one has extensions from SO that should be ok https://docs.google.com/spreadsheets/d/1LMsw75fwfKBi5H-BFt3opKVV1VuEcT7HUE34bHtOSq4/edit#gid=0 Thanks cc @rjdodson
@pfey03 dictabase has
That need to be changed to occurs_in.
Also
dictyBase | DDB_G0278741 | enables | GO:1990404 | protein ADP-ribosylase activity | PMID:28252050 | ECO:0000314 | | 20180907 | dictyBase | occurs_at(SO:0001454 amino_acid ),occurs_at(SO:0100014 n_terminal_region) | |
I dont think these are correct ? What are you trying to express ?
Sorry Pascale I have a few questions
Does this mean that the MF GO terms such as: GO:0000978 RNA polymerase II proximal promoter sequence-specific DNA binding are going to be kept? I thought that the point of improving the SO terms was to be able to use them in the annotation extension field?
You suggest: For annotations with pattern: SO+geneid= has_input, use enables xxx binding + SO:DNA motif = has_input; but I don't understand what this means, please confirm with this example:
GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | has_regulation_target(MGI:MGI:1202709),occurs_at(SO:0000165 enhancer)
Are you saying this should change to: GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | ?MGI:MGI:1202709),has_input(SO:0000165 enhancer) Because that doesn't seem right as the term is not a child of DNA binding. I know that there has been a discussion about these sort of terms having DNA binding parents but I don't think this has been done yet?
Or are you saying this should change to:
GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | (what relation should be included? is this has_input? MGI:MGI:1202709), with NO SO additions?
And therefore we should also create: GO:0000980 RNA polymerase II distal enhancer sequence-specific DNA binding AE part_of GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific has_input MGI:MGI:1202709
Or using a similar example as above GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | occurs_at(SO:0000165 enhancer)
should we keep the annotation to GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | with NO SO additions
And should we also create: GO:0000980 RNA polymerase II distal enhancer sequence-specific DNA binding AE part_of GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific
What about this annotation? GO:1905773 | 8-hydroxy-2'-deoxyguanosine DNA binding occurs at SO:0002171 telomeric_D_loop From what you have said should this be changed to: GO:1905773 | 8-hydroxy-2'-deoxyguanosine DNA binding has_input SO:0002171 telomeric_D_loop
Do you have a deadline by which you want these all changed?
Thanks
Ruth
- Does this mean that the MF GO terms such as: GO:0000978 RNA polymerase II proximal promoter sequence-specific DNA binding are going to be kept? I thought that the point of improving the SO terms was to be able to use them in the annotation extension field?
I think what we really cannot manage in GO (or I would rather see managed elsewhere) is the individual DNA motifs. We can probably handle a few terms like 'promoter', 'proximal promoter' and enhancer, especially that these are not completely orthogonal with the DNA motifs (ie - would SO create 'enhancer E box' and 'proximal promoter E box'? or would we use 2 SO terms?)
I don't have strong feelings one way or the other for promoter and enhancer. Perhaps to discuss with Colin, Astrid and Marcio ?
- And therefore we should also create: GO:0000980 RNA polymerase II distal enhancer sequence-specific DNA binding AE part_of GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific has_input MGI:MGI:1202709
I think this is the best way to go. Again to discuss with Colin, Astrid and Marcio; what do you think ?
(Edited): although I don't know yet if we would use 'part_of' or 'has_part'. Right now it looks like we'll be using 'has_part' for compound functions.
- Or are you saying this should change to: GO:0001227 | DNA-binding transcription repressor activity, RNA polymerase II-specific | (what relation should be included? is this has_input? MGI:MGI:1202709), with NO SO additions?
Yes (to be confirmed by Colin, Astrid and Marcio)
I also thought that we had long ago decided at various meetings that we would use SO to capture promoter details. We discussed it at Hinxton, and then in numerous calls?
I can see that capturing the promoter specificity might not be of great use to GO where we are trying to describe gene functions. However, the MODs are doing much more that describing functions and our users require information about which gene a transcription factor transcribes, and which promoter it uses. We use GO/SO to capture this relationship between a TF and a gene and a regulatory region becasue the system to annotate exists, it's universal and it works well.
Even if GO discontinues the capture of specific promoter for a TF, we will continue to capture this at PomBase, becasue our users require it, but we can just filter these extensions out of the GO submission if necessary....
I also agree with Ruth that the promoter extensions are specific to the DNA binding terms, the genes are extensions to the transcription regulator terms.
val
@ValWood what promoter details are you looking to capture ? proximal/core/enhancer ?
I think I am getting confused what you are suggesting.....
- occurs_at SO:telomeric_D_loop
I am not sure about that one. Could we create this as a GO:CC, as a child of 'nuclear chromosome, telomeric region'?
TO EVERYONE:
All the transcription regulation annotations are not yet ready; we need clearer guidelines (I also emailed the people from GREEKC for input). However it'd be nice if everyone could have a look to see whether the proposed changed above would work for you.
I'll schedule a call about this for everyone with questions, as soon as I hear back from GREEKC.
Pascale
@ValWood what promoter details are you looking to capture ? proximal/core/enhancer ?
specific promoters
e.g
RNA polymerase II proximal promoter sequence-specific DNA binding
at Ace2_UAS
(although I am not bothered about proximal or enhancer specificity in the GO term, I only want to be able to capture the exact (SO) promoter onto some DNA-binding activity)
No Ace2_UAS is the name of the promoter consensus sequence CCAGCC
http://www.sequenceontology.org/browser/current_svn/term/SO:0001857
Ah, so this would be OK also with has_input SO:SO:0001857
Yes, we are happy to change these to has_input. We don't think it is really rigorously correct.
The rationale is that, correctly, the bound substrate is the whole DNA molecule, and the SO extension indicates the region where binding takes place.
However, we are happy to lose this precision.
or 'prepared to' lose this precision ;)
I think @mah11 might have fixed these. I was confused. I thought you wanted to stop using SO promoters in extensions.
I meant specifically these terms:
(and maybe a couple more of this type). Do you need these ?
Thanks, Pascale
I have found another case where has_input(SO) will not work: we have annotations to histone methyltransferase activity terms, with extensions identifying where the activity is observed (usually chromosomal regions; some have GO CC entries but for others we're using SO terms). This line from the spreadsheet is an example:
Changed | PomBase | SPBC428.08c | enables | GO:0046974 | histone methyltransferase activity (H3-K9 specific) | PMID:11283354 | ECO:0000314 | 20190303 | PomBase | occurs_at(GO:0031934 CC: mating-type region heterochromatin),part_of(GO:0030466) |
---|
In this sort of situation, the histone is the input. As a stopgap I'm changing occurs_at to occurs_in for both GO CC and SO.
Hi @mah11
Yes occurs_in is the correct relation in this case.
Thanks ! Pascale
I meant specifically these terms: SO:0001952 promoter_flanking _region SO:0000165 enhancer
not for pombe...but higher euks might..
Yes occurs_in is the correct relation in this [histone modifying activity] case.
In that case, some are done and the rest in progress. I still don't see how anything can occur in a sequence feature (or a molecule consisting of residues in the given sequence), though. "In" is simply the wrong preposition there.
@pgaudet regarding this:
dictyBase | DDB_G0278741 | enables | GO:1990404 | protein ADP-ribosylase activity | PMID:28252050 | ECO:0000314 | | 20180907 | dictyBase | occurs_at(SO:0001454 amino_acid ),occurs_at(SO:0100014 n_terminal_region) | |
I dont think these are correct ? What are you trying to express ?
It was reported in PMID:28252050 that the Histone H2B is being ADP-ribosylated at the N-terminus at a specific glutamic acid and I annotated that in P2GO to SO terms.
Just read above: Do I need to change these to has_input?
updated the two others to occurs_in plasma membrane.
Hi @pfey03 If I understand correctly, I though you want to use 'has input' Histone 2B. But the SO terms are not appropriate; it's not in the scope to capture the general region on the target protein (N-term/C-term).
OK ?
Thanks, Pascale
@pgaudet Ah, so it's not wanted that detail where on the target protein it does a modification, is that right? So I have to delete those SO extensions? Why are those allowed then, if not for MF?
Thanks, Petra
ok, I now deleted the two SO extensions I had to 'protein ADP-ribosylase activity' for gene adprt1A.
Thanks @pfey03
Note there are 960 transcription regulator activity (or child terms) annotations with occurs_at SO https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
Only 6 of these are to transcription coactivator/repressor activity GO:0003713 GO:0003714 https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO& goId=GO:0003714,GO:0003713&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants. All of these are UCL annotations. Following the discussions above I will change these to occurs_in.
The remaining annotation are all to GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific or child terms
There are no transcription regulator activity (or child terms) annotations that use the AE relation 'occurs_in' https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_in(SO&goId=GO:0140110&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
Please confirm that the following action should be taken:
for all GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific or child annotations remove the AE occurs_at(SO:xxx) keep has_input(ID of gene product regulated - what IDs should be listed here?
All GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific should also have an annotation to GO:1990837 sequence-specific double-stranded DNA binding (or child term) and can have the AE occurs_in(SO:xxx) and has_input(ID of gene product associated with the SO term) - what IDs should be listed here? and part_of GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific
If this is what is expected then I think some sort of computational approach to create this should be taken. ie a) check that all GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific (and child terms) also have an annotation to GO:1990837 sequence-specific double-stranded DNA binding (or child terms) b) as above but check that the SO captured in the dbTF annotation is also captured in the DNA binding annotation c) for any dbTF annotations without DNA binding annotation create these d) for any dbTF annotations with occurs_at(SO:xxx) information but without this information captured in the DNA binding annotation create these e) delete occurs_at(SO:xxx) information from the DbTF annotations
Also: For annotations with pattern: occurs_at GO:C = change to occurs_in; is this final decision, if so could we get Alex to change all 167 of ours? https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(GO:&assignedBy=ARUK-UCL,Alzheimers_University_of_Toronto,BHF-UCL,HGNC-UCL,ParkinsonsUK-UCL
Also: For annotations with pattern: occurs_at GO:C = change to occurs_in; is this final decision, if so could we get Alex to change all 167 of ours? https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(GO:&assignedBy=ARUK-UCL,Alzheimers_University_of_Toronto,BHF-UCL,HGNC-UCL,ParkinsonsUK-UCL
I've reviewed all ARUK-UCL while checking 'membrane raft' annotations. So ARUK are done. But I haven't done the BHF or Parkinson's annotations.
The only ones left at uniprot are these promoter element binding ones, I thought we were supposed to be able to use has_input with these? But we cant in P2GO and it looks like PomBase hasn't updated them either (at least not in the one example I am looking at 18059475 @ValWood )
F5HGI6 sequence-specific DNA binding occurs_at HSE
J9VHZ9 sequence-specific DNA binding occurs_at HSE
P10961 sequence-specific DNA binding occurs_at HSE
Q5KMX8 sequence-specific DNA binding occurs_at HSE
J9VE33 RNA polymerase II cis-regulatory region sequence-specific DNA binding occurs_at CDRE_motif
P03069 RNA polymerase II cis-regulatory region sequence-specific DNA binding occurs_at AP_1_binding_site
I don't understand what we are supposed to use
For annotations with pattern: occurs_at SO promoter, enhancer, etc: remove ? or change to the corresponding GO term
but we are referencing specific promoters here, we don't want to add SO terms for specific promoters?
I just checked, we use "occurs_at" We have tonnes of examples https://www.pombase.org/term/GO:0000978 this is how we link binding motifs to transcription factors?
We wanted to created GO terms for the specific motifs. The estimate was that there were about 50-100-ish and this was better than to use extensions.
@colinlog @thomaspd
Really? this seems to be a great place to use extensions? The number will eventually be the number of transcription factors sop ~1500 for human?
I.e. this is something GO could outsource (and these are sequence features so they belong in SO?)
it's quite hard to tell from reading this issue what has been decided, what is an open discussion. I think the initial statement in the first comment about it being the same as occurs_in is not right.
This is what I think we agree on:
I think that leaves only one option, continuing to post-compose with SO, using a distinct relation.
So is the decision to use a distinct relation from occurs_at? If there is not a specific proposal then maybe there is no action required here?
RE:
for TF activity, we need a separate relation from has-input, to distinguish from regulated gene
remember that DNA-binding TFs need 2 MF annotations a DNA binding activity, and a 'DNA-binding transcription factor activity (GO:0003700)' which describes 'regulation of specific gene sets'.
so the gene is an extension to the "DNA-binding transcription factor activity" term, and the sequence motif is an extension to the DNA binding term.
e.g. Atf1 https://www.pombase.org/gene/SPBC29B5.01 Ace 2 https://www.pombase.org/gene/SPAC6G10.12c Prz1 https://www.pombase.org/gene/SPAC4G8.13c Cuf1 https://www.pombase.org/gene/SPAC31A2.11c Pcr1 https://www.pombase.org/gene/SPAC21E11.03c Fkh2 https://www.pombase.org/gene/SPBC16G5.15c
so do we need a separate relationship to has_input?
i thought we all agreed long long ago that GO is not the place to keep track of the specific gene targets of the TFs. because we'll end up with crazy proliferation of annotations.
Yes, in P2GO I had SO extension I had to take out and after discussions it was decided that's too detailed for GO annotations.
I don't recall this discussion and we have always connected the promoter motif to the TF DNA binding site using SO IDs where they are known.
I am happy to filter these from the GO submission, but we would definitely continue to capture this information locally (In PomBase ) in this way, because we have no other mechanism to connect transcription factors to their target motifs and this information is useful for our users.
I have clearly been in parallel discussions though, because I remember being at a meeting where the GREEKC discussed this mechanism for connecting transcription factors to their target sequences, and so I wonder if this has been communicated to this group? Or maybe they have decided to do this another way. I am happy to follow whatever the consensus is on how to capture this connection in databases.
I concur with Val! The 'has_input' relation between molecular functions is sufficient when it comes to information flow. From gene product to gene product, and, when possible, from gene-product to cis-acting DNA or RNA sequences to the target gene and its encoded gene product(s).
GREEKC as I interpreted our task was to explore and devise efficient ways to make GO-CAM models that involve the cis-acting DNA and RNA sites, when they are described. This involves devising ontologically sound theoretical frameworks to do so. If the targets of miRNAs and dbTFs cannot be captured using GO vocabulary, there will be a stagnation of progress in the signaling pathways field as this (epigenetic regulation) is one of the current knowledge frontiers. We are perhaps slightly less in the dark as to why a cell is a macrophage, a liver cell or a neuron, but we are far from truly describing how they do this. GO has the advantage that it is not purely empirical, but aims to describe molecular functions assembled into biological processes that take place at given times in given bodyparts/cells. GO is logically built and stringent. There is a need to be able to truly break through the cis-acting part of the genome, to predict what activation of a dbTF or of a miRNA means beyond the bland statement 'And now gene expression patterns are changed and this is why the cell has a new behaviour'. What is different? How come it is different? What are the causal relations and what proteins embody the causal transitions?
Of course GO does not have the cataloguing of every miRNA target nor of every dbTF's targets in every cell as its main mission. However, GO can provide the theoretical elements, the building blocks and some of the tools, such as GO-CAMs, to empower much needed life sciences research, starting with permitting projection of existing 'experimentally demonstrated' knowledge which does include the target genes of dbTFs and miRNAs in the cell types where they are active, so as to join processes underlying cell differentiation. Let's flesh-out the paradigm of genetic information flow!
One last remark here; there will eventually be a move to high throughput annotations that are performed computationally, starting with he many sophisticated versions of transcriptomic and proteomic analyses that are currently producing petabytes of data tht are dormant in databases after the researchers cherry-picked their data to make a human-readable publication. I am not proposing that we manually curate every high or medium throughput experiment that is out there. I am advocating that GO put down a logically sound and robust framework to eventually do this computationally.
I guess it becomes a problem for GO if we are capturing every binding site upstream of every gene (i.e position in the genome). I can see why GO would not want these as extensions because they aren't contributing to the gene-specific causal model, but could potentially add a lot of extensions to the DNA binding term. At PomBase I am only recording the actual binding site motif, not the position (at present).
I guess we could migrate these to a different data type in a different section of the gene page.
I'm not even sure if SO want to represent every TF binding motif, although they have added the ones PomBase have requested so far (Our plan was to create sequence features for the ones that have been charactarised)
Useful queries in QuickGO 785 annotations Occurs_at https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants
680 annotations Occurs_at SO ID https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants
535 annotations DNA-binding transcription factor activity Occurs_at https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants&goId=GO:0003700&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
all of which use SO IDs
124 annotations DNA-binding Occurs_at https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
119 annotations DNA-binding Occurs_at SO https://www.ebi.ac.uk/QuickGO/annotations?extension=occurs_at(SO:&evidenceCode=ECO:0000269&evidenceCodeUsage=descendants&goId=GO:0003677&goUsageRelationships=is_a,part_of,occurs_in&goUsage=descendants
Suggest that the following occurs_at information is not useful so delete: occurs_at(SO:0000165) enhancer occurs_at(SO:0001952) promoter_flanking_region occurs_at(SO:0005836) regulatory_region
GO convention is that DNA-binding transcription factor activity (and child terms) will have annotation extensions that specify the gene regulated (using has_input UniProt/MOD ID); DNA binding (and child terms) can specify the target motif using has_input (SO motif ID).
Thus Annotations to DNA-binding transcription factor activity and child terms should have all the Occurs_at information deleted. Ideally these will have AE information (or GO-CAM) added that describes the the gene targeted by the dbTF using has_input UniProt/MOD ID.
Annotations to GO:0000976 transcription cis-regulatory region binding should use has_input of the SO motif target.
All existing more specific GO terms such as GO:0044323 retinoic acid-responsive element binding will be obsoleted with replace by GO:0000978 RNA polymerase II cis-regulatory region sequence-specific DNA binding and people can add the has_input SO term to these annotations.
For the 100 annotations that have occurs_at(GO:CC these should be changed to occurs_in(GO:CC
Any remaining annotations that these suggestions do not cover provide information here if you are not sure what to do
So just to check, do we exclude all SO terms from extensions? i.e. all of the specific TF binding sites? (these are descendants of (SO:0005836) regulatory_region)
"So just to check, do we exclude all SO terms from extensions? i.e. all of the specific TF binding sites? (these are descendants of (SO:0005836) regulatory_region) "
A visual representation of the re-factored Sequence Ontology terms for GO (via GREEKC) can be found at PMID: 34425241. I am trying to add the figure just below this comment.
Hi, I do not believe it is smart to conflate promoter and enhancer. Can this be discussed live, rather than as some kind of types of announcements in reply to questions? I ask because I imagine that some annotators invested careful work in these distinctions and jetissoning these is a pity for the future.
Furthermore, it is also not clear what happens in the GO-CAM, hence that is not the panacea-type of way out. I understand the drive to only annotate the target gene: eg: p53 drives p21, without knowing where the p53 binding sites in the p21 gene are and without knowing or stipulating whether those places are the promoter or an enhancer. However that should not prohibit deeper annotations. Myself I have been waiting fro the possibility to clone GO-CAM models elsewhere than in the Dev site to move on with this work.
Altogether, the future I see is that we have at least 3 SO terms we can use (one arent and 3 children) and that we do not close the door for these terms's descendants. Nuclear chromosome looping and protein-driven phase transitions will become more and more researched and therefore annotated. What do you think? Colin
On Tue, Jul 26, 2022 at 11:24 AM Val Wood @.***> wrote:
So just to check, do we exclude all SO terms from extensions? i.e. all of the specific TF binding sites? (these are descendants of (SO:0005836) regulatory_region)
— Reply to this email directly, view it on GitHub https://github.com/geneontology/go-annotation/issues/2584#issuecomment-1195237946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZVLKCE7ZRTZNNSPUFQZWLVV6VE3ANCNFSM4ILTYX3A . You are receiving this because you were mentioned.Message ID: @.***>
Hello,
Occurs_at is not in the Relations Ontology, and its intended usage is the same as 'occurs_in', therefore we will deprecate occurs_at. Annotations should be reviewed and moved as suggested, if possible:
https://docs.google.com/spreadsheets/d/1LMsw75fwfKBi5H-BFt3opKVV1VuEcT7HUE34bHtOSq4/edit#gid=0
Please write a comment if these suggestions don't work for some of your annotations.
Impacts:
AgBase Alzheimers_University_of_Toronto ARUK-UCL BHF-UCL CAFA dictyBase GO_Central MGI NTNU_SB ParkinsonsUK-UCL PomBase SGD UniProt