Modeling of 'Sequence Feature' (as a Domain Entity)

ga4gh / va-spec

An information model for representing variant annotations.

14 stars 2 forks source link

Modeling of 'Sequence Feature' (as a Domain Entity) #19

Open mbrush opened 5 years ago

mbrush commented 5 years ago

The ability to represent a specific sequence feature or region is required for several annotation types (e.g. Affected Feature annotations). I consider 'sequence feature' here as defined in the Sequence Ontology - a continuous extent of biological sequence that is defined by the ordering of residues that comprise it, and its position relative to some defined reference sequence. Features have an extent of 0 or more. A 'region' is a subtype of 'feature' defined as a feature with an extent greater than zero.

There are several ways we have seen sequence features be described in our data use cases, which our model may need to accommodate:

1. 'Identified' features: Most commonly we see defined/named features such as genes, transcripts, chromosomal bands, or functional motifs within a gene that have a proper name are referenced by a proper identifier (e.g. BRCA1 gene = NCBIGene:672). A proper id is typically sufficient to uniquely identify a feature:

id: NCBIGene:672

. . . but we also need a model to identify and describe genomic features that may not come with unique identifiers.

2. Positionally-defined features: Features can be arbitrary unnamed regions that are defined by describing their start and end positions relative to some reference. e.g the ClinGen data examples include objects representing things like "the region between 704 and 1016 on NP_005219.2" (link):

type: exon
featureLocation:
- start: 44905791
- end: 44908944
- reference: ncbi:CM000681.2

3. Post-composed feature descriptions: Features objects in a data set/message may represent a feature that lacks an identifier or positional information by building a description that uniquely defines the feature. e.g. the data below resolves to a specific feature instance - namely exon 4 in the APOE ENST00000252486 transcript:

type: "exon",
exonNumber: "4/4",
gene: "APOE",
transcript: "ENST00000252486".

An ideal Sequence Feature model would support representations of all these types.

mbrush commented 5 years ago

Throwing up a naïve straw man below for a schema that may accommodate representation of features described in source data in any of the three ways above. This is based on the limited requirements and thinking - and meant only as something concrete to critique and build on.

Schema

id: can be a standard/community identifier for a named/defined feature if it exists (e.g. an NCBI gene id)
type: the primary type of the object - can just be 'SequenceFeature', or possibly something more specific if we want to create subtypes
featureType: use this to record a more specific feature type (if we limit the primary 'type' above to just 'Sequence Feature')
parentFeature: a larger named/identified feature that the feature of interest is a part of (e.g. the transcript id that an exon feature is within)
featureNumber: useful to specify a specific instance of a feature type in some cases (e.g. ordered features within some parent, such as exons)
featureLocation: an object representing a location on a reference - to support representing features defined using specific coordinates
- start: int
- end: int
- reference: the reference sequence used

Examples

The following examples present ways to represent exon 4 of the APOE transcript 201 in Ensembl using the proposed schema.

Example 1: there is no standard/community identifier for exons, so here the specific exon of interest is specified using a set of attributes in the schema above. It is less precise than defining based on specific coordinates - but should allow resolution to a specific location using databases like ensembl.

id: ex:feature001 (in this example there is no standard/community identifier)
- type: ga4gh:SequenceFeature
- featureType: SO:0000147 (exon)
- parentFeature: ensembl:ENST00000252486 (APOE transcript 201)
- featureNumber: "4/4"

Example 2: here the specific exon of interest is specified using it positional coordinates

id: ex:feature001 (in this example there is no standard/community identifier)
- type: ga4gh:SequenceFeature
- featureType: SO:0000147 (exon)
- featureLocation:
- start: 44905791
- end: 44908944
- reference: ncbi:CM000681.2

@rrfreimuth had some thoughts about how a 'feature set' object might be useful here, based on HL7 modeling work. Their initial modeling proposals are summarized here.

Relevant example data in our catalog:

ClinGen Example 2

javild commented 5 years ago

Comments:

I'd keep just one type attribute, the featureType above but named simply type if we cannot find any use case in which this is required
parentFeature : I wouldn't include this in the generic Feature model.
- It'll not be used for many Feature types
- It's not as explicit as it could be for entities that really need it, e.g Exon, Transcript, Gene
- Instead, I'd make entities such as Exon, Transcript or Gene to extend the generic Feature model and to complement the list of generic attributes with the corresponding explicit attribute, e.g. transcriptId, geneId, respectively.
featureNumber: same thing as with parentFeature
featureLocation: I'd reuse VMC location object to keep consistency among modelling works
strand? does VMC location object allow to specify strand?

Thus i'd propose following attributes:

id: can be a standard/community identifier for a named/defined feature if it exists (e.g. an NCBI gene id)
type: a term describing the type of feature. This term shall be taken from a systematic, maintained vocabulary such as the Sequence Ontology e.g. chromosome_band (SO:0000341) (can be a complex {id, name} object itself)
location: position within a reference sequence, e.g. chromosome (Example 8), a protein (Example 2), etc. The Variant Representation ‘location’ model can be used for modeling this attribute. Strand: when appropriate, one of {“+”, “-”} indicating either the positive or negative strand.

See for example how Transcript attributes would look like here https://docs.google.com/document/d/1Ezq_gbzqEuZHvGMJqBQqkS9WfUf0Z5hHQIWHNOZ5N_8/edit#heading=h.1wyxw8wfmyhv

mbrush commented 5 years ago

TO DO: enumerate types of Sequence Features we find in examples of this VA type, and diversity of representational structures/schema used for each type.

mbrush commented 5 years ago

See VICC requirements for modeling Sequence Features here: https://github.com/cancervariants/metakb/issues/6#issuecomment-443317966

larrybabb commented 5 years ago

@javild from above

I'd keep just one type attribute, the featureType above but named simply type if we cannot find any use case in which this is required

I think the "type" attribute is a given in every entity we model. It always is fixed to the entity type. This is useful in message structures when putting entities into generic lists or elements that allow for a composite set of types to be used. So I think it is technically useful but not necessarily useful when looking at it in isolation. I think @mbrush and the team should identify the core attributes that are "locked" down for all entities so that we don't have to deal with them each time. (id and type for starters).

larrybabb commented 5 years ago

@javild from above

parentFeature : I wouldn't include this in the generic Feature model.

It'll not be used for many Feature types

It's not as explicit as it could be for entities that really need it, e.g Exon, Transcript, Gene

Instead, I'd make entities such as Exon, Transcript or Gene to extend the generic Feature model and to complement the list of generic attributes with the corresponding explicit attribute, e.g. transcriptId, geneId, respectively.

I am not refuting or agreeing with the need for parentFeature here. I'm focused on your last bullet point from above

Instead, I'd make entities such as Exon, Transcript or Gene to extend the generic Feature model and to complement the list of generic attributes with the corresponding explicit attribute, e.g. transcriptId, geneId, respectively.

I do think this is important in that some features will have a direct association to entities that will be explicitly modeled (possibly as subclasses). I think we should think about this and determine whether or not Gene, Transcript, Exon should/will get their own "entity definitions" or not and if so how they should relate to the SequenceFeature entity (assuming we complete it). I think there will always be the need to define SequenceFeatures as a generic thing and if we do need Gene, Transcript, Exon or other types of specializations we should consider them separate entities that have a relationship to the SequenceFeature (possibly).

There's a good amount to discuss and decide just on this point alone. @mbrush ?

mbrush commented 5 years ago

Some updated thoughts based on considerations of bucket variant representation driven by somatic KB use cases.

We need a model for sequence features that allows identification of the feature using reference to an existing identifier, when one exists, or composing a description that points to an unnamed feature. The model shouldn’t capture assertions about the feature that would be modeled as annotations (e.g. association with disease, a functional impact, a molecular consequence). These are not defining elements of the feature. Rather, we can use the Variation Definition object to capture these as 'criteria' that specify the set of precise variants what fit this definition.

One question (raised by Javi above) is whether we have specializations for key feature types such as gene, transcript, exon - that could allow the base feature model to be a bit simpler. It might be worth listing these key feature subtypes that we would want to create specialized models for (e.g. to support the variation bucket requirements).

Regardless, there are cases where the existence of a feature is implied by some description, but this doesn’t have a proper identifier - so we must capture the description that 'identifies' it. e.g. exon 3 in the EGFR gene. the feature between positions g.1000 and g.50000 on human chromosome 8. The model should support this.

The model would have to include:

generic object metadata (e.g. identifier, label, type, description)
basic metadata about the type of the Affected Feature (e.g. sequence type, biotype)
for an “identified” Affected Feature, a reference to an existing identifier.
for a “composed” Affected Feature, elements to capture its defining characteristics (e.g. its absolute and/or relative location, related/parent features, etc.)

TO DO: make a straw man for the Generic and Specialization approaches.

sailakss commented 4 years ago

I've used the following fields in the transcript model in the ClinGen Linked Data Hub resource (https://ldh.genome.network). LDH is in active development phase. An example: https://ldh.genome.network/ldh/AlleleMolecularConsequenceStatement/id/NC_000001.11:g.40783789_40783795del

"preferredTranscripts": [ { "biotype": "protein_coding", "canonical": true, "id": "ENST00000347132", "molecularConsequence": "5_prime_UTR_variant", "source": "Ensembl", "tsl": "1" }, { "biotype": "protein_coding", "canonical": true, "id": "NM_004700.4", "manePreferredRefSeq": true, "molecularConsequence": "5_prime_UTR_variant", "source": "RefSeq" }

javild commented 4 years ago

Hi @sailakshmi4 , I've been going through the data example you provided above. I can see these attributes which are specific of the transcript:

canonical
manePreferredRefSeq
tsl
source
biotype
id

Some comments on these:

canonical, manePreferredRefSeq: would it be OK if we consider these two as annotation flags in the proposed draft (https://drive.google.com/file/d/1xR895TIuBEWeQUZvK-2gVo-eTAff4QH6/view?usp=sharing). These are booleans in the model above and it seems aligned with -for example- how ENSEMBL considers them (https://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html). @andrewyatz ?
tsl: would it be OK if this is also an annotation flag? That's how I was thinking of it at the beginning (also here https://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html) although in ENSEMBL's GTF appears as a separate transcript attribute, i.e. "... transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";" @andrewyatz ?
source: I find this one particularly interesting. @mbrush, could we possibly re-use any bit of the provenance model here? I'm thinking of something like the RecordMetadata object

andrewyatz commented 4 years ago

canonical & manePreferredRefSeq: having them as boolean flags makes more sense. FYI we in Ensembl will look to remove the concept of canonical in favour of Ensembl Select, which will be our preferred representative for a locus. Where available MANE Select will be the same as the Ensembl Select. Also note that MANE Plus will also appear in time (an expanded set of MANE transcripts where multiple transcripts can exist for a single locus) so these flags will grow. That might mean you want them as annotation flags.

tsl: like APPRIS this is also a multi-valued tag. I am sure there are others out there. We represent them as a separate transcript attribute but that's because we want them to be clearly linked to a transcript and GTF doesn't give us that many options in all fairness. If an annotation flag can handle this then great.

Ping me for anything else you need

mbrush commented 4 years ago

Some relevant comments were made by various VR Leads on the 3-16-20 VR Call:

I think of cytobands, genes, and other sequence features as conceptual locations. That is, entities that are primarily thought of as a thing rather than as a location, but which can be mapped to a coordinate-based location when a given set of mappings is available.

Therefore, conceptual locations can be mapped to a coordinate location when a map is defined, but that map should be out of scope for VRS. This approach also presents a way of linking conceptual locations to coordinate locations without conflating the two.

Conceptual locations are not the physical location against a reference sequence. Gene location and cytoband locations are concepts that map to physical locations

When someone says you have a duplication of a gene, you’re not duplicating a location, you’re duplicating a feature or function. It might be a different type of intent.

mbaudis commented 4 years ago

@mbrush Not participating in VR anymore, but very much in line with the notes in the last post. However:

cytobands are not conceptual/functional units but just identifiers for large genomic regions - i.e. "approximately from here to there" - nothing more. But IMO, as above, cytobands per se may not be of interest for a VR model; they are proxy locations which just need an (external) resolver service (alas, could also point to reference mapping files)
we usually do not primarily talk about a "duplicated gene", but about a duplicated genomic locations when analyzing CNVs (i.e. in the scope of VR); so, as noted, the "duplicated gene" already has an attribution included - but this is different from "duplicated cytobands", which is just a location proxy

mbrush commented 4 years ago

I see that the cardinality on the location attribute of the proposed Sequence Feature (SF) class is 0..1. The implication here is that a SF instance represents a particular feature as defined in a single reference context (as opposed to a 'conceptual' feature, e.g. the notion of the EGFR gene generally, which may have different discrete/concrete instances as mapped to build 37 vs 38). Expanding the cardinality to 0..m to accommodate mapping a single SF instance to multiple builds/reference contexts would suggest that an SF object is more at the conceptual level.

This issue parallels recent conversations about Feature-Based Locations and how they map to different builds in the VR group led by @larrybabb - which I'm not sure was ever resolved. We should coordinate representational principles and approach across these modeling areas as best possible. Even consider if/how we might use VR's feature-based location classes in our work.

Finally, resolution of these questions is relevant to modeling of Relative Location Statements, specifically the issue raised here.

(UPDATE, Oct 2020 - cardinality on the SF location field has since been broadened to 0..m).