geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
217 stars 40 forks source link

Availability of Sequence Ontology Terms in Noctua (go-lego) #20419

Open jesualdotomasfernandezbreis opened 5 years ago

jesualdotomasfernandezbreis commented 5 years ago

Hi.

Would it be possible to have the Sequence Ontology terms in NOCTUA, so we could use them in the models?

Thanks, Jesualdo

pgaudet commented 5 years ago

Can priority of this ticket be increased ? If we have the resources, of course.

cmungall commented 5 years ago

Hi @jesualdotomasfernandezbreis were you there for Karen's presentation on MSO at the GREEKC meeting. We're currently discussing SO vs MSO (but don't want to block any progress or work you need to do). Do you and other GREEKC participants have any opinions or requirements for this decision?

jesualdotomasfernandezbreis commented 5 years ago

In GREEKC we are currently using SO but we will use MSO. We will migrate once GO integrates MSO. Kimberly van Auken explained us how to use terms not provided by Noctua in the models, so we can currently make progress in our work.

cmungall commented 5 years ago

See also: https://github.com/geneontology/noctua/issues/561

pgaudet commented 5 years ago

@jesualdotomasfernandezbreis Are the terms needed already in MSO ? Perhaps we should directly load MSO ?

Pascale

jesualdotomasfernandezbreis commented 5 years ago

We are currently using the SO terms, but they will have to be replaced by the corresponding MSO terms. For such a migration we agreed to wait until the SO terms are replaced by the MSO ones in the GO. But if the MSO terms become available in Noctua we could start using them.

Best, Jesualdo

vanaukenk commented 3 years ago

From discussion at 2020-11-17 GO-CAM Jamboree.

The GREEKC group would like more SO terms available to them in Noctua.

Is the MSO/SO distinction still important for GO-CAMs?

If not, can we add all of SO to neo?

If yes, does anyone know what the current status of the MSO vs SO work is?

@colinlog

colinlog commented 3 years ago
  1. As far as I understand, most if not all of the SO terms GREEKC wants to add should be considered as continuants and therefore RO relations applicable to continuants can be deployed. David Sant, Michael Sinclair, Chris Mungall, Ruth Lovering, Colin Logie and Karen Eilbeck are finalizing a manuscript where a minimum short list of SO terms that GREEKC wishes to apply to logically reason on transcription regulation is provided -- in case GO would wish to allow application of only a selection of SO terms in the Noctua tool.
  2. The protein parts of the SO would also be cool to have, amongst others to be able to (optionally) draw causality relations on the basis of post-translational modification of specifically indicated residues of a protein (isoform). Having listened yesterday to @ValWood on her modelling of the involvement of MAP kinase cascades and TORC in cell cycle progression regulation I believe that SO terms may be most useful for such relations whereby positively and negatively regulates is a case-by-case / residue-by-residue thing that cannot be simply dumped on the phosphorylated-unphosphorylated substrate dichotomy. Of course, also for gene regulation and in particular the histone proteins, being able to name residues and their modification is paramount as H3K4me3 is radically different from H3K9me3, to just name two examples of more than 250 possible post-translational modifications of the histones.
balhoff commented 3 years ago

I'm going to move this issue to the ontology repo, since that's where this change would be made (in production of "go-lego").

vanaukenk commented 3 years ago

Thank you @balhoff !

davidwsant commented 3 years ago

Hello all,

SO should be sufficient for the purposes of GO and will contain the most recent updates.

The member that created MSO has been in a new position for a couple of years and MSO has not really been maintained by the SO brach. All updates to SO have only been added to SO on the SO GitHub Page and not to MSO.

Best,

Dave Sant

pgaudet commented 3 years ago

Hi,

I am working with Colin to add new SO terms as logical definitions for GO terms. Once this is done we'll re-evaluate the need to include SO in Noctua.

Thanks, Pascale

pgaudet commented 3 years ago

Those are the terms of interest from SO: first I'll add them to the imports

SO:0000727 CRM (cis-regulatory module) (is a) SO:0000167 promoter (is a)
SO:0000170 RNApol II promoter (is a) SO:0001669 RNApol_II_core_promoter (has part)
SO:0001240 TSS region (is a) SO:0000315 TSS (has part) binds GO:0016591 who does GO:0001055 and GO:0003899 for GO:0006366 SO:0000165 enhancer (is a) SO:0000625 silencer (is a) SO:0000627 insulator (is a) SO:0002307 DNA_loop (is a) SO:0002308 DNA_loop_anchor (has part) SO:0002304 topologically_associated_domain (is a) SO:0002305 topologically_associated_domain_boundary (has part) SO:0001720 epigenetically_modified_region (is a) SO:0000305 modified_DNA_base
SO:0001700 histone_modification

SO:0000235 TF binding site (has part) binds GO:0003700 and GO:0000981, who bind GO:0003712 SO:0000713 DNA motif (has part)

Pull request for this list: https://github.com/geneontology/go-ontology/pull/20576

ValWood commented 3 years ago

Note that PomBase use a lot of SO IDs in extensions for

RNA polymerase II cis-regulatory region sequence-specific DNA binding

we have used all of these https://www.pombase.org/browse-curation/dna-binding-sites and hope to continue to do so.

Since this is the only way we have to connect a transcription factor to a binding site on the transcription factor gene pages, and it is useful information for our users, we will still do this in PomBase and filter the extensions for submission to GO if it is disallowed.

But for the list above many seem likely to be redundant with the GO terms.

pgaudet commented 3 years ago

Hi @ValWood

With @thomaspd we had proposed to instantiate all those in GO (as x motif binding); would you prefer to keep them as extensions ?

colinlog commented 3 years ago

Hi @ValWood @davidwsant: From the PomBase SO term constellation use it looks like SO could also host the currently known specificities of the human dbTF monomeric or homomultimeric DNA binding sequence motifs. There are less than one thousand of these mapping to more than one thousand human dbTFs, but they may be relevant to tens of thousands of dbTFs across the phylogenic species trees. GREEKC has contacts with the researchers that could feed such an annotation into SO, namely the authors of the Catalogue manuscript https://www.biorxiv.org/content/10.1101/2020.10.28.359232v2. Hence proteins and their annotation in GO terms would include SO:term entries that can be linked to DNA position weight matrices. One other GREEKC authority to consult if SO wishes to do that is Philipp Bucher, perhaps?

In the nitty gritty spirit, a mapping of the existing SO:motif and SO:element entries to the available human motifs could be performed? Forkhead to name one example. But that is not strictly necessary if all the new entries are in first instance specified as human? @davidwsant: How does SO envisage species?

Just to be clear: the annotation exercise for GO-Noctua models would concern experimentally / biochemically determined chromosomal binding sites linking to one or multiple genes. For that, a placeholder for the DNA material entity is needed in the form genomebuild:chromosome:start-end. This is wholly independent of the DNA sequence specificity specification notion that the concepts ‘motif’ and ‘element’ encompass. Nevertheless, some researchers have elucidated both the genomic position for a transcription regulator and the local chromosomal DNA sequence corresponding to a motif instance and capturing this is an ambition for many GREEKC use cases.

What strikes me is that PomBase is using SO terms to provide a non-amino acid encyclopedic definition of the genomic binding sites involved. Simply linking gene entries to granular SO terms. Most of the human dbTFs could have something like this too, as the ‘motif’ / ‘element’ information is often known for the simplest biochemical interaction: pure DNA and 1 pure recombinant dbTF protein. Alongside this, other motif types, derived from ChIP-seq experiments are available too for many human dbTFs (heteromeric complexes) and databases for these exist too.

Epigenetically, accessibility, DNA base methylation status of the sequences and nucleosome modification are also of importance biologically and also this has been studied and documented and could enter Noctua-based high-throughput annotation, which is why the above small selection of SO terms was requested.

Ultimately, SO:term instances of genomebuild:chromosome:start-end will enable inputting and reasoning computationally across GO’s universe of annotations when transcription regulator activities are considered. If additionally, the SO: description includes DNA motif / element information like it currently does for Pombe, that would be a nice windfall for GO, isn’t it?

ValWood commented 3 years ago

With @thomaspd we had proposed to instantiate all those in GO (as x motif binding); would you prefer to keep them as extensions ?

I would prefer to keep them as extensions. It is a big overlap with SO otherwise. @mah11

ValWood commented 3 years ago

and it's 1000 terms for human...

davidwsant commented 3 years ago

Hi guys,

As TF binding sites are DNA motifs, I think this makes sense to add the annotations to SO.

I think we would not separate terms by species, or at least I have not done this in the past. For example, bacterial rRNA terms and eukaryotic rRNA terms both fall under the same parent. I think this would hold true for the transcription factors as well. Currently I can find one yeast TF, pheromone_response_element, which is_a TF_binding_site. A sister term of this is retinoic_acid_responsive_element which is present in humans.

I looked at the link from the Catalogue manuscript. It looks like the link has the names of all of the transcription factors, but I do not see the consensus sequences. It has 1,429 TFs listed, and it appears as though they are all human. I will need to get the consensus sequences for these to include in the definitions.

I also have a question about the names for the terms. How about something that is listed as ESR1? Would ESR1_binding_motif be a child of TF_binding_site, or would we include it in estergen_response_element? I would like some input from others on that.

Colin, what is your take?

Thanks, Dave

RLovering commented 3 years ago

Hi Dave this does sound like a major undertaking, and likely to lead to a very flat ontology under TF_binding_site. Unfortunately my understanding of the binding sites/motifs is quite limited. So any grouping that could be done would be preferable. Looking at the UniProt record for ESR2 https://www.uniprot.org/uniprot/Q92731 it states that ESR2 activates expression of reporter genes containing estrogen response elements (ERE) in an estrogen-dependent manner. Which suggests that both ESR1 and ESR2 are binding the same element - therefore specific SO terms for both ESR1 and ESR2 would not be required.

I think that Arttu Jolma (https://www.sciencedirect.com/science/article/pii/S0092867412014961) would be a good person to discuss this with.

Best

Ruth

colinlog commented 3 years ago

That this is no small enterprise. Perhaps a couple of months to get everyone to agree (and to disagree in part). However, if there is SO commitment we can explore with the GREEKC experts. The particular data set I had in mind is the set of specificities for human dbTF exposed individually to DNA, and is not as such a response element like for example the estrogen response element. It may be better described as the 'ESR1 motif', and is a biochemically defined object, namely 'the DNA sequences this individual dbTF binds well'. In humans, many dbTF form heterodimers, however, and the natural chromosomal response elements therefore tend to include two binding sites, one for each subunit of the dimeric protein complex.

On Wed, Dec 16, 2020 at 2:08 AM David Sant notifications@github.com wrote:

Hi guys,

As TF binding sites are DNA motifs, I think this makes sense to add the annotations to SO.

I think we would not separate terms by species, or at least I have not done this in the past. For example, bacterial rRNA terms and eukaryotic rRNA terms both fall under the same parent. I think this would hold true for the transcription factors as well. Currently I can find one yeast TF, pheromone_response_element http://sequenceontology.org/browser/current_svn/term/SO:0002045, which is_a TF_binding_site http://sequenceontology.org/browser/current_svn/term/SO:0000235. A sister term of this is retinoic_acid_responsive_element http://sequenceontology.org/browser/current_svn/term/SO:0001653 which is present in humans.

I looked at the link from the Catalogue manuscript https://www.ebi.ac.uk/QuickGO/targetset/dbTF. It looks like the link has the names of all of the transcription factors, but I do not see the consensus sequences. It has 1,429 TFs listed, and it appears as though they are all human. I will need to get the consensus sequences for these to include in the definitions.

I also have a question about the names for the terms. How about something that is listed as ESR1? Would ESR1_binding_motif be a child of TF_binding_site, or would we include it in estergen_response_element? I would like some input from others on that.

Colin, what is your take?

Thanks, Dave

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-ontology/issues/20419#issuecomment-745696973, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZVLKD2X6PLLGLGO6HBGVDSVACABANCNFSM4T2EXJEQ .

davidwsant commented 3 years ago

I agree with what Ruth said about it being a very flat ontology under TF_binding_site. I would prefer that they not all be listed under a single term.

I agree, this sounds like a very large undertaking. I do think getting some help from some experts would be a good idea. Colin, you mentioned getting help from the GREEKC experts. Do you think they would be willing to help even though GREEKC is ending?

ValWood commented 3 years ago

How many binding sites are currently known? I'm presuming that this info is only available currently for a subset of transcription factors? ( so even though a large undertaking, it will be sporadic once the known sites are included).

davidwsant commented 3 years ago

I believe the latest version of the ENCODE project includes information from ChIP-seq experiments with hundreds of different transcription factors across several cell types. I have been looking through the data, and it looks like K562 cells have 628 different transcription factor experiments, but some of them are replicates (like POLR2A, MYC and JUN). Here is the link where you can see the different experiments studying DNA-binding proteins through ChIP-seq. I am not particularly familiar with the data here or how to access it, but I believe they have a way to download the locations of called peaks for each transcription factor for each cell type studied. I know the pipeline I have previously used for analyzing ChIP-seq was the pipeline they developed in this consortium (called the irreproducible discovery rate, IDR). While this is definitely still a subset, if it has even 500 transcription factors that would be a great deal.

colinlog commented 3 years ago

I believe one thing should be done at a time. Providing the dbTF intrinsic DNA binding motif is feasible. But, the very many ChIP-seq datasets are a whole different matter altogether because the chromosomal binding sites for dbTF vary in their occupancy within one cell type as a function of environmental/cell culture conditions and between cell lineages. Those are very much a matter of study. Can SO host the coordinates for the 1500 human dbTF genomic binding sites in all the different human cell types? Is such a thing desirable when ENCODE already has all this information available? I think not. What SO can provide is a controlled vocabulary in the form of terms that make precise operations possible.

However, the motifs that are bound by dbTFs are protein-specific and they can be stored as position weight matrices that are equivalent to a consensus DNA binding site. While for ChIP motifs there is still much dissent/discussion as to what exactly and how exactly these should be rendered, for the in vitro (pure DNA + pure individual dbTF protein) this is not contentious/disputed and is 'absolute' and is available for more than 1000 dbTFs (I counted 1007 dbTF from the current human dbTF Catalogue with such an associated motif).

The group of GREEKC experts that can help are Philipp Bucher and the authors of the dbTF Catalogue paper. As for their concrete contribution, I would let them (Ivan, Oriol, Arttu + Phillip and other GREEKC experts they see fit to include) discuss whether they want to do this (in January 2021?) and how to go about it. What GO and SO can do at this point in time is to tell these experts that SO is committed to storing the resulting product and that therefore their efforts will therefore not result in an ephemeral product.

@ValWood @davidwsant @RLovering The instances of dbTF DNA binding to the genome can ultimately be captured by GO-CAM-type annotations, which is why GREEKC needs SO terms, be they adopted as GO terms or as SO stand-alone terms. One major conceptual hurdle is that by itself, the ChIP-seq experiment does not provide proof that there is causality for gene regulation. Hence, not every ChIP-seq site can be labeled as a 'response element' while they can all be labelled as genomic binding sites. In my mind, the annotation of genomic dbTF binding sites and response elements on the genome is therefore complementary but orthogonal to the creation of generic SO motifs for each human dbTF in SO. It is the latter that can be done in the short term and the product is independent of heteromeric dbTF-dbTF and dbTF-cofactor interactions at the genomic binding sites. Can everybody see the distinction?

davidwsant commented 3 years ago

Hi Colin,

I agree, trying to add the individual locations of binding motifs would not be consistent with SO. That is actually not what I had in mind. Adding the motifs would be a possibility, however. I don't know if we can use the position weight matrices because the definitions in SO can't hold multiple dimensions. For RARE, for example, we just put the consensus sequence as PuGGTCA. Do you think this is fine, or do you think it would be better to do something like this: A [ 12 0 1 0 0 22 4] C [ 0 0 0 0 18 0 7 ] G [ 11 23 13 1 3 1 5] T [0 0 9 22 2 0 7]?

I agree that binding does not necessarily mean that it is regulating any gene. I think that labeling them as binding sites is probably a good call.

Dave