ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 32 forks source link

Explain the problem cause by 'intersecting' expansion sets #49

Closed mbrush closed 5 years ago

mbrush commented 5 years ago

Hi all. On the May 6 VR call (and in Hinxton the week before), concerns were raised about scenarios in which a single discrete variant instance (e.g. NP_478102.2:p.Ser73Arg) might appear in more than one variation expansion sets in a given corpus of VA data (e.g. ClinVar).

Can someone explain why this creates a "conundrum" - perhaps with examples of specific tasks/use cases where this is problematic, and what the problems are? @larrybabb calling on you first here as you raised this concern most recently - I think from the perspective of mapping ClinVar variation ids to ClinGen allele registry ids.

Thanks, and apologies if this is clear to others and I am just being dense. But I suspect that myself (and others) may be thinking about this from different angles/perspectives, or imagining a different workflow for creating, indexing, and querying over variation sets. From where I sit it is not clear how 'intersecting' expansion sets cause problems.

larrybabb commented 5 years ago

My reference to this concern was coming from the notion of how to "correctly" handle aggregation of statements that are related to sets that have intersecting members.

In ClinVar the variationID expansion sets include lift over between genomic builds as well as both projection from genomic to transcript and transcript to protein sequences. All of these lift-over and projected forms are provided for the purpose of aggregation.

One of the primary use cases for variant knowledgebases like ClinVar is to aggregate knowledge about a variant. In the case of ClinVar this knowledge is being submitted by heterogeneous sources into an archive. Agreement and conflict identification between the same or similar statements about the same variant is a major feature and value proposition of the ClinVar archive. ClinGen works with ClinVar submitters and ClinVar to identify, verify and help with expert resolution of aggregate variant pathogenicity statements in ClinVar. ClinVar has automated "review status" policies informed by ClinGen and other submitters input. This review status is often referred to as the "star levels" for a given variant or variant-disease combination.

So when submissions are made to ClinVar it is very important to be able to gather knowledge about the same variant, reliably.

If you do a search on ClinVar BRCA2 variant p.Ser326Arg[Variant name] you will see that there are 3 separate records, each with separate submissions of pathogenicity for similar diseases. While this may seem harmless, it isn't clear (to me) whether these three sets of statements can be aggregated and compared without altering the intended meaning of the statements themselves.

I think it is important that we clarify that if a set of variation is used as a subject of a statement then we presume that the entire set collectively represents what the statement is about. If we take the approach that any member of the set is a suitable substitute for the statement subject itself then we must be sure that we are not changing the intended meaning of the original statement.

I suppose therefore that maybe the concern isn't so much about whether members of the set intersect as it is about whether any member of the set can be substituted for the statement's subject without altering the meaning of the statement.

larrybabb commented 5 years ago

I just had a conversation with a clinical geneticist about this issue in some detail. They agree that the subject of a pathogenicity statement is often the set of variants (genomic->transcript->protein). All statements of pathogenicity should start with the DNA change was tested and observed but also what prediction evidence, literature and other evidence related to a particular transcript and its predicted protein change that lead to a specific disease pathogenicity assertion. This combo of genomic, genomic+transcript or genomic+transcript+protein are all important to clearly defining the statement being made by the submitter and should be grouped (or associated) as such.

However, it would not be reasonable to presume that ClinVar's generalized variation set for each variant in ClinVar is precisely reflective of what each submitter intended for the statements/submissions associated with a given variant.

reece commented 5 years ago

A variant in two sets is a hallmark of a potential problem, but not the problem itself. Furthermore, it depends on intended use.

If a variant is in two distinct sets, the set is not "complete" (i.e., not fully connected). The implication, then, is that variants in that intersection are in some way like the disjoint members of the sets, but that the disjoint members are dissimilar. This should be the first hint that annotating sets is going to be fraught with problems in which an annotation might apply to only some members of these sets, but that which ones in unspecified.

See this spreadsheet:

Study of 3 ClinVar "alleles" hgvs sets in gene CDKN2A that contain "NP_478102.2:p.Ser73Arg". Allele 133657 (①) occurs at one location. Alleles 484124 (②) and 9423 (③) occur at another location. All result in Ser73Arg in one transcript, but have different consequences in other transcripts.

Okay, now imagine using a ClinVar set as an annotation subject. The most immediate problem is that it's unclear which set should be the annotation subject if you want to make a statement about an intersection member.

The other problems are correlated with the existence of overlapping sets, but not caused by them. For example, overlapping sets tend to be the kinds that are hetergeneous bags of stuff. Such sets have very little value (IMO) because the subject is ill-defined (or, rather precisely defined as a bunch of things), which dilutes the utility of any statement. And, my personal peeve is that this actually encourages people to make sloppy annotations on the wrong subjects. For example, a genomic variant is NEVER a stop codon: it might have that consequence in a particular transcript, but it is not itself a stop codon. And it's certainly never a catalytic residue in a protein.

reece commented 5 years ago

@mbrush Would you please close this issue if you think it's resolved? Thanks

mbrush commented 5 years ago

Thanks Reece - I now understand a bit more about the perspective from which you see intersecting sets as problematic (or indicative of potential problems). I want to dig into one statement in particular, as it highlights a potentially different view of the workflow/context in which expansion sets will be created and applied:

Imagine using a ClinVar set as an annotation subject. The most immediate problem is that it's unclear which set should be the annotation subject if you want to make a statement about an intersection member.

This suggests a workflow in which annotation creators would 'shop' for and aim to re-use existing expansion sets initially created for some other annotation, or perhaps created independently of a particular annotation altogether. In my view 'expansion sets' are always created de novo for a particular annotation, to include all discrete variations which an agent believes are representative of the 'biological variant' to which the knowledge applies. So the scenario described above where it is "unclear which set should be the annotation subject if you want to make a statement about an intersection member", would never exist . . . because expansion sets are not 're-used' in this way (or if they are, there is always the option to create a new one if no existing set suffices).

If this is the case, then the specific issue above of 'selecting the right set for re-use' isn't necessarily a problem. But there are plenty of other reasons why we might prefer not to allow variation expansion sets as annotation subjects (which I will summarize elsewhere). I think that we agree about this in principle . . . based on what I heard in Hinxton, on calls, and in tickets, we all believe that it is best to always annotate to a specific, discrete variation. And that expansion sets, where desired, can be captured to the side.

As highlighted above however, it seems that the reasons why we feel this way may not be entirely overlapping, as a result of different views on the workflow/context in which expansion sets will be created and applied. I also think there are differing opinions about allowing expansion sets as annotation subjects in practice - that despite our philosophical objection to this, it may be impractical to completely ban it. But these are topics for another ticket/document, and a future call.

I will leave this ticket open a couple days in case there are responses to this comment - then close.