Modeling metadata for 'static' Variation Sets

mbrush commented 4 years ago

Background

Following Variation Set discussions at the Boston Plenary, it was decided that VR will only provide a model for static Variation Sets - as simple objects holding just an enumerated list of members. A computed hash-based id will be generated based on this content.

_id: curie [1..1] (computed from hash of members)
type: curie {VariationSet}
members: Variation [0..m]

VR will NOT model virtual/computed Variation Sets as a separate type of Variation object that can be directly annotated, as initially proposed. This means that for sets of variation that are computed (rather than manually enumerated), we will represent and identify the materialized set that results, but not the functional definition of the set as a separate type of Variation.

Variation set creators will, however, have the option of capturing info about the functional definition and source of a computed variation set, as provenance metadata about the static set. VA and VR will work together to define elements and structure for this. We will also have to provide a model to describe how the simple Variation Set objects provided by VR should be interpreted as the subject of a VA Statements. More on this to come.

Importantly, we also agreed that these Variation Sets can be direct annotation subjects (i.e. fill the subject slot of a VA Statement). At present, this is the only way to annotate a Variation Set (there is no no 'To the Side' option). This means that third parties/aggregators wishing to apply a different Variation Set as the subject of an existing VA Statement would create a separate VA Statement with their set in the subject slot. They could then reference the original VA statement to indicate the source from which the new statement was derived. (This is in contrast to the 'To the Side' model that allowed for attaching multiple Expansion Sets to the same VA Statement - allowing alternate interpretations of the subject to be represented together in a singe annotation.)

As an aside, the static Variation Set VR will provide can be used (once decorated as described above) to cover two of the three VA use cases for sets (Expansion Sets, Variation Profiles). It will not cover the Categorical Variation use case.

mbrush commented 4 years ago

The idea of a Variation Set Metadata object that would live outside and reference a Variation Set was proposed at the Boston Plenary. It would provide a way to capture features of a Variation Set that cannot be included in the set itself because they are not "identifying" features used in computation of its identifier.

I started a gdoc here to begin organizing notes/ideas on this topic. I think the initial work here may be too big to do in this ticket, an the gdoc format enables and collaborative editing and commenting on specific components of the proposal. Rather, we can use this ticket to capture key questions or outcomes arising from work in this google document.

larrybabb commented 4 years ago

@mbrush At the Oct 30 '19 VA meeting I suggested modeling a few real ClinVar examples to help get down to the concrete issues that the metadata model would need to support for variation sets. You requested I identify a few. Here's some examples.

481015
this variation set has at least one "functional assessment" (SCV) associated to the aggregate representation, which requires a specific transcript context, whereas the clinical assessments don't
689385 a text variant from OMIM with not reference sequence basis
617461 a protein only variant - there are only 103 of these in clinvar and it is unclear if they accept these anymore.

There are a bunch of types of clinvar variations, which can be found by searching on the "type of variation" property in the advanced search. Here's the current list of variation types.

complex
compoundheterozygote
copy number gain
copy number loss
deletion
diplotype
distinct chromosomes
duplication
fusion
indel
insertion
inversion
microsatellite
protein only
single nucleotide variant
tandem duplication
translocation
variation

Interesting question... Are all of these types considered variation sets the way that ClinVar normalizes them? If not, then is there something in the variation set metadata that will define the "type" of variant that the set represents, so that it can play well with the other non-set types?

mbrush commented 4 years ago

@larrybabb, regarding:

481015 this variation set has at least one "functional assessment" (SCV) associated to the aggregate representation, which requires a specific transcript context, whereas the clinical assessments don't.

. . . which of the three SCVs aggregated by this VCV are you referring to, and where in the VCV record might I find this information about a transcript-specific functional assessment?

Also, i would love to hear more about why you chose these specific examples and if they pose specific challenges for our modeling efforts - in particular how common these challenges may be, and if the lessons they provide are generalizable. Just don't want to get too caught up on challenging edge cases if they represent very rare and unique outliers. Perhaps we can discuss on the next VA call.

larrybabb commented 4 years ago

SCV000925744.1 was @dsonkin functional transcript-based submission, which was aggregated with the other clinical and research pathogenicity statements.

There's no where in the data that indicates whether the lab, researcher, clinic, etc.. (submitter) was doing a functional study that was transcript-specific of not. This is one that @dsonkin pointed out for the submissions and research he is providing to clinvar. This is the beauty of the flexibility of clinvar and the challenge for how far we can actually go with re-representing data in clinvar. In many, many cases, we cannot assume we know what the data specifically is. We must generalize and recognize that we don't always know the quality, precision and scope of the data and evidence submitted.

I didn't put too much thought into the choices. I don't believe these are edge cases as I mostly picked some random items based on variant types.

If the aim is to be able to represent any and all data in clinvar then these are good cases. If we are trying to only identify the main cases (like acmg-based path calls) then I think it will be challenging to figure out precisely how to map these without presuming too much on behalf of the data.

dsonkin commented 4 years ago

In 481015 submission I was reporting functional evidence specifically based on transcript NM_000546.5. As you can see on ClinVar page in "Functional consequence:" section I directly specified "NM_000546.5:c.375+5G>A" to make sure there is no confusion about which transcript is used for variant interpretation. Because there are unfortunate technical caveats which may cause problem with reporting splice variants in ClinVar based on transcript notation used above I had to specify variants using genomic locations. ( In ClinVar submission file I've also provided transcript notation in "Alternate designations" column). Exon numbers specified in this submission would make no sense without knowledge of transcript used for functional interpretation of a variant.
There are clearly cases in which transcript information is irrelevant, for example non-coding change at enhancer. However we definitely have to have ability for submitter to clearly specify which transcript was used for a variant interpretation. High quality variants interpretations are based on ACMG/AMP guidelines and done of course using a particular transcript in mind, since variant may have different molecular consequences on other transcripts. For example in case of TP53 guidlines transcript NM_000546.4 listed right at the top of specifications. If HGVS cDNA notation is a provided in ClinVar submission submitter HGVS cDNA notation is preserved in ClinVar DB. These are 2 examples for FGFR1 522553 and 373103 which are reported using different transcripts.

ga4gh / va-spec

Modeling metadata for 'static' Variation Sets #51

Background