biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

support a variety of non-gene sequence features using 'type' and a generic parent class #1210

Closed sierra-moxon closed 1 year ago

sierra-moxon commented 1 year ago

from @hitz

What are your thoughts on Objects like: A “putative cis-regulatory element” defined by a consensus chromatin accessibility peak with or without supporting histone mark data" Or “An segment of human genome DNA tested in an MPRA experiment to determine it’s effect on gene expression” Or “A region targeted by guide RNAs using a CRISPr activation screen” I suspect we would have to make some extension for biolink:GenomicEntity to cover non-Gene features.

And in addition, if we want to use biolink:GenomicEntity directly, we need to figure out how to move it out of the mixin hierarchy.

hitz commented 1 year ago

@sierra-moxon What about: http://www.sequenceontology.org/browser/current_release/term/SO:0005836 "regulatory_region". Is there a way (or even a need) to qualify this with "proposed' or "putative"? I supposed you can just assign evidentiary properties "(determined by chromatin accessibility / ATAC-seq / in cell like GM12878)

This would be a Class that uses biolink:GenomicEntity as a mixin.

hitz commented 1 year ago

There is also: http://www.sequenceontology.org/browser/current_release/term/SO:0002331 (accessible_dna_region) and http://www.sequenceontology.org/browser/current_release/term/SO:0000235 (tf_binding_site)

tf_binding_site ISA http://www.sequenceontology.org/browser/current_release/term/SO:0000235 which ISA regulatory_region (SO:0005836)

accessible_dna_region ISA epigenically_modified_region (this seems wrong btw) which ISA regulatory_region (SO:0005836)

Do you think when modeling a KG like this it's better to use the more general parent (so "all" items are findable without closure) or use the most specific version and rely on the ontology graph to connect.

hitz commented 1 year ago

if we just use BioCypher to implicitly subclass BiologicalEntity? Then I wouldn't have to actually submit a PR for this ticket...

hitz commented 1 year ago

@sierra-moxon does my comment make any sense?

sierra-moxon commented 1 year ago

It does, but if you extend in BioCypher, it's a precursor to a PR in Biolink right?

What if you just used the biolink class, 'biolink:NucleicAcidEntity' and add the node property, 'biolink:type' to hold a more specific SOTerm of your choosing for each more specific sequence feature that you need (this was @cmungall's original idea)?

I took another look at 'biolink:GenomicEntity' and I hesitate to move it to a class vs. a mixin because it is the way we currently bridge the biology/chemistry perceptions of gene as a biological entity vs. a chemical entity (it's both). But I'm willing to explore options here if 'biolink:NucleicAcidEntity' does not make sense.

@sierra-moxon What about: http://www.sequenceontology.org/browser/current_release/term/SO:0005836 "regulatory_region". Is there a way (or even a need) to qualify this with "proposed' or "putative"? I supposed you can just assign evidentiary properties "(determined by chromatin accessibility / ATAC-seq / in cell like GM12878)

right - I think you could handle the predictive nature of this with evidence and provenance properties.

sierra-moxon commented 1 year ago

your subject and object nodes might look something like this:

category: biolink:NucleicAcidEntity
type: SO:0005836
id: mydb:12345

category: biolink:NucleicAcidEntity
type: SO:soterm_for_chromosome
id: NC_007112.7

your edge might look like this (these would all be edge properties between the chromosome, or whatever reference sequence you wanted to locate the NucleicAcidEntity on, and the NucleicAcidEntity itself) :

subject: mydb:12345
predicate: biolink:has_sequence_location
object: NC_007112.7
category: biolink:GenomicSequenceLocation
start_interbase_coordinate: 123
end_interbase_coordinate: 456
genome_build: xyz

I think we could add an edge property to represent the predictive nature of some of these locations - we've been discussing adding "prediction" or "statistical correlation" or "hypothesis" keywords as evidence types to further qualify an edge.

hitz commented 1 year ago

` regulatory region: description: >- A region (or regions) of the genome that contains known or putative regulatory elements that act in cis- or trans- to affect the transcription of gene is_a: biological entity mixins:

Not sure if I should make a PR to biocypher instead?

sierra-moxon commented 1 year ago

@hitz - would you consider these new classes to be children of NucleicAcidEntity as well?

hitz commented 1 year ago

@sierra-moxon I think I would but Gene class isn't? That seems like a mistake.