ga4gh / va-spec

An information model for representing variant annotations.
14 stars 2 forks source link

Relative Location (Affected Feature) Annotation Definition and Scope #18

Open mbrush opened 5 years ago

mbrush commented 5 years ago

Our initial scope for this VA type was limited to assertions that a variant lay within or overlapped a specific feature in the genome (e.g. a specific gene, exon, intron, motif). However, examples have arisen (e.g. from CellBase data) that may expand the scope of what can be asserted using an Affected Feature statement/annotation.

Specifically, I propose that we model Affected Feature statements to enable descriptions of the following types of assertions:

  1. Variant x is within / overlaps Feature y - the simplest use case, described above.
  2. Variant x is within Feature y, specifically at Position z - extends (1) to include the position where the variant affects the feature.
  3. Variant x is within / overlaps Feature y, and covers z-percent of Feature y - extends (1) to include a the proportion of the affected feature that the variant overlaps.

The specific position (2) and percent overlap (3) can be modeled as optional qualifiers that refine/extend the core statement that the variant hits a feature to provide this extra information. We might call these something like positionAffectedQualifier and percentAffectedQualifier.

mbrush commented 5 years ago

A separate issue concerns how Affected Feature annotations are related to the subset of Molecular Consequence annotations that indicate an affected feature type (see #4) - and if there is overlap in scope here.

I would argue that there is not, as Affected Feature annotations describe how a variant hits a specific feature instance (e.g. the Shh gene, exon 2 of a specific Shh gene transcript). The 'affected feature type' subset of MC annotations state that a variant affects a certain type of feature (e.g. an interior intron), but typically don’t resolve to a specific affected feature instance.

mbrush commented 5 years ago

Proposed definition and comments:

Definition: an annotation that describes the location and/or extent of a variant relative to some other defined location in a genome, transcript, or protein (e.g. a chromosomal band, gene, exon, functional region or motif, mutation hotspots).

Comments:

larrybabb commented 5 years ago

I think we should forego the "Affected" term and go with "Relative" or "Related". This is about the relative proximity of the variant to the feature. It may have additional qualifiers that let the annotator specify other relative-ness of the variant to the feature. But, "affected" could be confused with "having an impact on" the feature, which is not the case. At least, I don't see that in the examples. Please clarify if I am mistaken.

mbrush commented 5 years ago

Given questions/concerns about the name for this VA type, starting a list of possible names (open for anyone to extend with additional suggestions).

  1. Affected Feature Annotation: this is the current name, but concern about assumptions that the feature is impacted by the variant in the molecular consequence or functional impact sense.

  2. Relative Location Annotation: original name for this annotation type. Thinking this is perhaps more suitable now, as it is clear that these annotations are only about the location and not the affect/impact at that location.

  3. Variant Proximity Annotation: May be more appropriate if we consider statements that a variant is adjacent or close to some feature as in scope . . . but I like 'Relative Location' better in this case.

  4. Sequence Neighborhood Annotation: UniProt groups these types of annotations in a 'Sequence Neighborhood' section, e.g. https://web.expasy.org/variant_pages/VAR_070505.html. I don't really love this, but adding to the list as it is used by a popular knoweldgebase.

mbrush commented 5 years ago

Remaining issues to sort out to wrap initial pass at the MC statement model:

ahwagner commented 5 years ago

We should clarify scope to include adjacent / nearby regions, if that is intended for this annotation type. I think there is great value in making this a "containing feature" or "overlapping feature" annotation type, with adjacent/nearby features being a separate annotation type, though that may just be due to a lack of motivating examples.

larrybabb commented 5 years ago

In regards to naming concern (2 above). We (as a standards setting group) should pick the term that best represents the defined concept, which in my opinion is No. 1 relative location. We should clarify that our model will not prohibit users to "relabeling" these annotation types, assuming they can use a json-ld implementation.

And, it makes sense to capture "synonym" or "alternate" labels to deal with this type of problem which will occur on many if not all of these modeled statement types.

We can move faster if we vote or select the fundamental label and the provide a place to capture alternatives - assuming they have a reasonable motivation that is stated.

mbrush commented 5 years ago

Remaining issues following December 5 call:

  1. Finalize name: Relative Location? Affected Feature? Relative Feature Location? other . . .
  2. Review RO set of topological sequence relationships as predicate value set - see here
  3. Discuss whether to include adjacency relationships (variation is next to but not overlapping with a feature).
  4. Clarify difference with molecular consequence annotations about type of feature affected.
mbrush commented 5 years ago

Outcomes from 12-19-18 Call:

  1. Use 'Relative Location' as a label for these for now
  2. Adjacent features are in scope - but would like to see examples/use cases for this.
  3. General agreement to use RO set of topological sequence relationships a recommended predicate value set - see here.
mbrush commented 5 years ago

SO ticket here provides some interesting considerations/requirements: https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/85.

And Chris Mungall's paper about applying Allen Relational algebra to sequence intervals: https://www.biorxiv.org/content/biorxiv/early/2014/06/27/006650.full.pdf

mbrush commented 5 years ago

Another question to consider is if we want to allow Relative Location statements to describe how a variation defined on one molecule type may be located relative to a feature on a different molecule type. e.g. that a genomic-level SNV falls within a protein-level transmembrane domain. The implication is that the protein level variation projected from the genomic variation falls within the protein-level feature.

I think it is useful to be able to make these kinds of statements, and thus they should be allowed. But we will have to consider modeling implications / features to support this. It may also impact recommendations we make w.r.t. variation expansion sets (and whether/how expansion across a projection function is used).

mbrush commented 5 years ago

Consider also if we capture the descriptor here as a vmc Location, instead of a SequenceFeature - and use Feature-Based Locations when the affected location refers to some gene, exon, etc.

mbrush commented 4 years ago

Recent decision form VA calls related to relative location are below:

1 - Define value set for the Relative Location predicate. RESOLVED.


2 - Do we want to allow Relative Location statements to describe how a variation defined on one molecule type may be located relative to a feature on a different molecule type? RESOLVED


3 - How to represent the affected feature? T.B.D.

AmandaSpurdle commented 4 years ago

How to represent the affected feature? - if there is included elswhere HGVS notation against a specific reference sequence then knowing the reference and variant allele should be covered. but specific position within a motif could be important. eg a variant at +_12 position of a splicing motif (or last base of the exon) is much more likely to alter splicing than one at say +5 position. so if the intention is to say that a variant falls into a motif within an exon, there would be value in being more specific

mbrush commented 4 years ago

A couple sources have provided potential new requirements to support additional detail/precision around characterizing the nature/extent of how a variant overlaps with a feature (in addition to the percentOverlapQualifier we have right now).

One is Amanda's comment above - which suggest that we should include a qualifier to capture where within the affected feature the variant hits. I added a featureSubloctionQualifier attribute as an exploratory element in the Relative Location spec here, which takes a VR Interval object as its value.

Another is the Beaconv2 variant annotation use case here to provide additional detail about the nature of a variant's overlap with a given feature:

"variantGeneRelationship: Categorical value classifying the variant according to the broadness of the variant effect in terms of genes: intergenic, 5UTR, 3UTR, single-gene (exonic, intronic), in overlapping genes (exonic, intronic), spanning multiple genes, multiple genes"

My gut feeling is that this use case should be handled by finding/adding the right term to the value set bond to the descriptor (e.g. SO).

Finally, in the absence of (or in addition to these solutions, users can always add any information that adds detail or nuance to the core, structured relative location statement into a free text description field.

mbaudis commented 4 years ago

@mbrush We have general interest in genome feature affection and physical overlap concepts (e.g. deletion overlapping a gene locus - CDR? 3'? Amount of overlap? And then also: services translating this to genome coordinates, so that one can run respective queries on un-annotated variant stores...

mbrush commented 4 years ago

@mbaudis we can support the first to interests/use cases (overlap and how much). BUt services translating into genome coordinates is beyond our remit.

mbrush commented 4 years ago

On the April 1 VA call the question arose about if/how the model supports description of the relative location of a variant w.r.t. a feature as defined in different reference contexts (e.g. the promoter region of the EGFR gene as defined on build 37 vs 38). The answer may depend on the precision at which Sequence Features are defined in our model (do they represent a conceptual feature that may map to different locations in different reference contexts, or do the represent a specific feature in a specific context).

This issue was noted in the Sequence Feature ticket #19. An answer to the question about Relative Location Statements here may depend on how this on question about Sequence Feature precision is resolved.

mbrush commented 3 years ago

Decisions/Outcomes of 10-14-20 VA Call about the Relative Location Statement Model:

To Resolve: