Relative Location (Affected Feature) Annotation Definition and Scope

mbrush commented 5 years ago

Our initial scope for this VA type was limited to assertions that a variant lay within or overlapped a specific feature in the genome (e.g. a specific gene, exon, intron, motif). However, examples have arisen (e.g. from CellBase data) that may expand the scope of what can be asserted using an Affected Feature statement/annotation.

Specifically, I propose that we model Affected Feature statements to enable descriptions of the following types of assertions:

Variant x is within / overlaps Feature y - the simplest use case, described above.
Variant x is within Feature y, specifically at Position z - extends (1) to include the position where the variant affects the feature.
Variant x is within / overlaps Feature y, and covers z-percent of Feature y - extends (1) to include a the proportion of the affected feature that the variant overlaps.

The specific position (2) and percent overlap (3) can be modeled as optional qualifiers that refine/extend the core statement that the variant hits a feature to provide this extra information. We might call these something like positionAffectedQualifier and percentAffectedQualifier.

mbrush commented 5 years ago

A separate issue concerns how Affected Feature annotations are related to the subset of Molecular Consequence annotations that indicate an affected feature type (see #4) - and if there is overlap in scope here.

I would argue that there is not, as Affected Feature annotations describe how a variant hits a specific feature instance (e.g. the Shh gene, exon 2 of a specific Shh gene transcript). The 'affected feature type' subset of MC annotations state that a variant affects a certain type of feature (e.g. an interior intron), but typically don’t resolve to a specific affected feature instance.

mbrush commented 5 years ago

Proposed definition and comments:

Definition: an annotation that describes the location and/or extent of a variant relative to some other defined location in a genome, transcript, or protein (e.g. a chromosomal band, gene, exon, functional region or motif, mutation hotspots).

Comments:

Affected feature statements minimally assert that the variant overlaps with a particular feature, but may also describe the location within the feature that is affected, and/or the proportion of the feature that is affected.
Assertions that a variant falls within a certain type of feature (as opposed to a specific feature instance) are typically made using a Molecular Consequence annotation that annotates the variant with a Sequence ontology term (e.g. SO:0000203 three prime UTR, SO:0000507 pseudogenic exon, SO:0000191 interior intron, SO:0001566 regulatory region variant)

larrybabb commented 5 years ago

I think we should forego the "Affected" term and go with "Relative" or "Related". This is about the relative proximity of the variant to the feature. It may have additional qualifiers that let the annotator specify other relative-ness of the variant to the feature. But, "affected" could be confused with "having an impact on" the feature, which is not the case. At least, I don't see that in the examples. Please clarify if I am mistaken.

mbrush commented 5 years ago

Given questions/concerns about the name for this VA type, starting a list of possible names (open for anyone to extend with additional suggestions).

Affected Feature Annotation: this is the current name, but concern about assumptions that the feature is impacted by the variant in the molecular consequence or functional impact sense.
Relative Location Annotation: original name for this annotation type. Thinking this is perhaps more suitable now, as it is clear that these annotations are only about the location and not the affect/impact at that location.
Variant Proximity Annotation: May be more appropriate if we consider statements that a variant is adjacent or close to some feature as in scope . . . but I like 'Relative Location' better in this case.
Sequence Neighborhood Annotation: UniProt groups these types of annotations in a 'Sequence Neighborhood' section, e.g. https://web.expasy.org/variant_pages/VAR_070505.html. I don't really love this, but adding to the list as it is used by a popular knoweldgebase.

mbrush commented 5 years ago

Remaining issues to sort out to wrap initial pass at the MC statement model:

Final vetting of definition, scope, statement structure for Affected Feature annotations (see proposal above and modeling spreadsheet)
Pick better name? (see comment above)
Consider whether, if we use the VMC, the statement here might technically be modeled as a relationship between vmc:locations (i.e. the location associated with a variant and a region instead of the variant and region themselves)
Record considerations/requirements for the Topological Relationship predicate value set
- One use case from the requirements doc was to assert that a deletion completely removes a particular exon. This is a case where one of the more precise RO topological relationships could be used to assert that the variant completely spans the exon - e.g. 'bounds sequence of'.
Clarify distinction between MC affected feature type annotations, and Affected Feature/Relative Location annotations . . . does comment on the definition of these annotation types make this clear?
Vet model against example data/requirements notes
- Consider how data fits with proposed name, definition, scope, and model.
- Consider range/diversity of feature types, and varied representations of each type, that a Sequence Feature model will have to cover.
- Consider contexts/use cases in which these annotations are created and used. e.g. 'overlap operations' Reece/VR team proposed.

ahwagner commented 5 years ago

We should clarify scope to include adjacent / nearby regions, if that is intended for this annotation type. I think there is great value in making this a "containing feature" or "overlapping feature" annotation type, with adjacent/nearby features being a separate annotation type, though that may just be due to a lack of motivating examples.

larrybabb commented 5 years ago

In regards to naming concern (2 above). We (as a standards setting group) should pick the term that best represents the defined concept, which in my opinion is No. 1 relative location. We should clarify that our model will not prohibit users to "relabeling" these annotation types, assuming they can use a json-ld implementation.

And, it makes sense to capture "synonym" or "alternate" labels to deal with this type of problem which will occur on many if not all of these modeled statement types.

We can move faster if we vote or select the fundamental label and the provide a place to capture alternatives - assuming they have a reasonable motivation that is stated.

mbrush commented 5 years ago

Remaining issues following December 5 call:

Finalize name: Relative Location? Affected Feature? Relative Feature Location? other . . .
Review RO set of topological sequence relationships as predicate value set - see here
Discuss whether to include adjacency relationships (variation is next to but not overlapping with a feature).
Clarify difference with molecular consequence annotations about type of feature affected.

mbrush commented 5 years ago

Outcomes from 12-19-18 Call:

Use 'Relative Location' as a label for these for now
Adjacent features are in scope - but would like to see examples/use cases for this.
General agreement to use RO set of topological sequence relationships a recommended predicate value set - see here.

mbrush commented 5 years ago

SO ticket here provides some interesting considerations/requirements: https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/85.

And Chris Mungall's paper about applying Allen Relational algebra to sequence intervals: https://www.biorxiv.org/content/biorxiv/early/2014/06/27/006650.full.pdf

mbrush commented 5 years ago

Another question to consider is if we want to allow Relative Location statements to describe how a variation defined on one molecule type may be located relative to a feature on a different molecule type. e.g. that a genomic-level SNV falls within a protein-level transmembrane domain. The implication is that the protein level variation projected from the genomic variation falls within the protein-level feature.

I think it is useful to be able to make these kinds of statements, and thus they should be allowed. But we will have to consider modeling implications / features to support this. It may also impact recommendations we make w.r.t. variation expansion sets (and whether/how expansion across a projection function is used).

mbrush commented 5 years ago

Consider also if we capture the descriptor here as a vmc Location, instead of a SequenceFeature - and use Feature-Based Locations when the affected location refers to some gene, exon, etc.

mbrush commented 4 years ago

Recent decision form VA calls related to relative location are below:

1 - Define value set for the Relative Location predicate. RESOLVED.

We will recommend use of the Relation Ontology sequentially_related_to hierarchy here, e.g. overlaps, does_not_overlap, contained_by, adjacent_to, etc. For v0 we will not constrain to a subset of these - but informally recommend use and see what feedback we get nce it is in use.

2 - Do we want to allow Relative Location statements to describe how a variation defined on one molecule type may be located relative to a feature on a different molecule type? RESOLVED

For example, we would allow assertions that a genomic-level SNV falls within a transcript level feature such as 3'UTR , or a protein-level feature such as a transmembrane domain (the implication being that the protein level variation projected from the genomic variation falls within the protein-level feature).
Decision: Allow for now, see if we get pushback.

3 - How to represent the affected feature? T.B.D.

Here we need to represent the notion of a ‘canonical feature’ at the class level, as opposed to specific instances/alleles of the feature. e.g. the EGFR gene generally, not a particular allele/version of it.
It was proposed to consider using VR Feature-Based Location models here (e.g. 'gene location', 'exon location', 'promoter location', etc), instead defining a Sequence Feature class (and specializations) in our model.
With this approach, we are in essence representing the location associated with a feature/region instead of the feature/region itself. A subtle ontological distinction, but one that could have practical implications.
Pros:
- makes use of existing work (or at least work that will exist), and aligns us more tightly with VR
- modeling these as locations/loci reduces the conceptual confusion about these representing canonical features, as opposed to specific instance of a feature with a defined sequence/state.
Cons:
- VR locations are developed for a different use case, and may not support the kinds of things we want to say about them
- models for VR feature based locations are not mature, and it is unclear if they would define types of all the specializations/biotypes we would need (e.g. genes, exons, promoters, functional motifs/domains, or ad hoc unnamed features), and the things we would want to capture about them.
Proposal: model these ourselves as 'Sequence Features' initially, but treat this as an exercise for defining requirements that can be communicated to VR so they can consider then in how they model their feature-based locations. If our needs are compatible with their, we can move toward adopting VR feature-based locations as they mature.

AmandaSpurdle commented 4 years ago

How to represent the affected feature? - if there is included elswhere HGVS notation against a specific reference sequence then knowing the reference and variant allele should be covered. but specific position within a motif could be important. eg a variant at +_12 position of a splicing motif (or last base of the exon) is much more likely to alter splicing than one at say +5 position. so if the intention is to say that a variant falls into a motif within an exon, there would be value in being more specific

mbrush commented 4 years ago

A couple sources have provided potential new requirements to support additional detail/precision around characterizing the nature/extent of how a variant overlaps with a feature (in addition to the percentOverlapQualifier we have right now).

One is Amanda's comment above - which suggest that we should include a qualifier to capture where within the affected feature the variant hits. I added a featureSubloctionQualifier attribute as an exploratory element in the Relative Location spec here, which takes a VR Interval object as its value.

Another is the Beaconv2 variant annotation use case here to provide additional detail about the nature of a variant's overlap with a given feature:

"variantGeneRelationship: Categorical value classifying the variant according to the broadness of the variant effect in terms of genes: intergenic, 5UTR, 3UTR, single-gene (exonic, intronic), in overlapping genes (exonic, intronic), spanning multiple genes, multiple genes"

My gut feeling is that this use case should be handled by finding/adding the right term to the value set bond to the descriptor (e.g. SO).

Finally, in the absence of (or in addition to these solutions, users can always add any information that adds detail or nuance to the core, structured relative location statement into a free text description field.

mbaudis commented 4 years ago

@mbrush We have general interest in genome feature affection and physical overlap concepts (e.g. deletion overlapping a gene locus - CDR? 3'? Amount of overlap? And then also: services translating this to genome coordinates, so that one can run respective queries on un-annotated variant stores...

mbrush commented 4 years ago

@mbaudis we can support the first to interests/use cases (overlap and how much). BUt services translating into genome coordinates is beyond our remit.

mbrush commented 4 years ago

On the April 1 VA call the question arose about if/how the model supports description of the relative location of a variant w.r.t. a feature as defined in different reference contexts (e.g. the promoter region of the EGFR gene as defined on build 37 vs 38). The answer may depend on the precision at which Sequence Features are defined in our model (do they represent a conceptual feature that may map to different locations in different reference contexts, or do the represent a specific feature in a specific context).

This issue was noted in the Sequence Feature ticket #19. An answer to the question about Relative Location Statements here may depend on how this on question about Sequence Feature precision is resolved.

mbrush commented 3 years ago

Decisions/Outcomes of 10-14-20 VA Call about the Relative Location Statement Model:

Permissible subject variations: allow genomic, transcript, and protein level variations
Allow 'cross-molecule level' annotations: for v0 we will allow statements that cross genomic-transcript-protein levels, e.g. that a genomic variation overlaps / is within a protein, or a specific protein functional domain/motif. We will recommend keeping things at one molecule level in our documentation/ implementation guidance, but there will be a place (in a DTO) to capture other representations of the subject variation at different molecule levels (e.g. to capture the protein level variant in the example above)
Sequence Feature model: We confirmed that we want the Sequence Feature domain entity in the object/descriptor slot of the Statement is to represent an abstract/conceptual feature, not a specific feature instance defined in a particular reference context. This is reflected in the 0..m cardinality on the Location field in the Sequence Feature Class, which defines a specific reference context for locating the feature.

To Resolve:

The model we defined for the two qualifiers in this statement (percentOverlap and subLocation) do not work well with this perspective on SFs as abstract/conceptual entities. Both the qualifiers apply to a Feature in a specific reference context - as the values of these fields may be different in different reference contexts. e.g. percent overlap of a variant with a feature may be different depending on which particular instance of the feature you are talking about. So if >1 Location are described for the Feature, the current model will not support linking these qualifiers to its relevant reference context.
- Proposal: Consider the approach of modeling the values of the two qualifiers as objects (Study Results?) , where we can capture the specific instance of the Feature for which the data captured here is relevant (the 'focus' of a StudyResult)
RL statement predicates support ability to say a variation is adjacent to some SF. In this case the subLocation qualifier is not relevant, as it is used to describe where within a related feature the variant is, not how far away it is. But a qualifier indicating how close an ‘adjacent’ variation is to a SF of interest would be useful to support real data use cases we have encountered. e.g. Uniprot has annotations describing how far away a protein-level variant is from key functional domains in the protein. At present, there is no way to capture this.
- Proposal: Consider generalizing the subLocation qualifier to allow for capturing ‘distance from’ a SF of interest (and if upstream or downstream), not just ‘location within’ one? Or defining a new qualifier (e.g. ‘distanceAwayQualifier or some such) to capture this type of info.

ga4gh / va-spec

Relative Location (Affected Feature) Annotation Definition and Scope #18