ga4gh / va-spec

An information model for representing variant annotations.
Apache License 2.0
17 stars 4 forks source link

Clarify the semantics of Oncogenicity vs Pathogenicity #29

Open mbrush opened 5 years ago

mbrush commented 5 years ago

This ticket aims to get a better handle on how oncogenicity and pathogenicity are related, and the constraints we should model around them with respect to variant origin, the disease object, and the semantics of the predicate in these annotation types.

At the heart of this is clearly distinguishing what is being stated in an interpretation that says record that says a BRCA variant is oncogenic for breast/ovarian cancer (e.g. here), vs pathogenic for breast/ovarian cancer (e.g. here), vs pathogenic for a cancer predisposing syndrome (e.g. here). Lots of issues to unpack here, which I attempt to do below. Hoping we can discuss in this ticket and/or on upcoming VA calls.


1) What is the difference between a proper cancer and a 'familial cancer' or 'cancer predisposition syndrome' (as defined in terminologies like MedGen)?

e.g. what is the difference between the following term in the MedGen hierarchy of diseases:

2) How should we define the semantics of 'oncogenic for' vs 'pathogenic for'? The commitment we made initially in defining Variant Pathogenicity Interpretations (VPIs) and Variant Oncogenicity Interpretations (VOIs), holds that:

By these definitions we can say that a germline variant is pathogenic for a Mendelian condition or a cancer predisposition syndrome, and that a somatic variant is oncogenic for a Cancer. But it would not make sense to allow the statement that a BRCA variant is pathogenic for a Cancer - because a single variant cannot be pathogenic for a multi-gene condition like Cancer. But such assertions are made in many ClinVar records (see examples below).

We should discuss/hear from experts - but ultimately may need to loosen/refine the definitions above to account for the realities presented by the data (e.g. perhaps we can allow for 'pathogenic for' to mean 'cause or contribute to', instead of 'cause on its own' . . . this would allow its use with Cancers, but also perhaps loose some precision, such that we aren't able to differentiate between variants that are causal vs contributory). Alternatively, we stick to our guns and provide guidance for 'correcting' data that break the rules of our model.

3) What types of conditions should be allowed in the descriptor slot for VPI vs VOI statements? Consider what is meant when the conditions in point (1) above are used in variant pathogenicity/oncogenicity interpretations? For example, in ClinVar BRCA2 variants are asserted to be pathogenic for all of them (see here). Under what semantics of the predicate and condition terms are these statements valid?

4) Should we modify the semantics of our model to allow less precise representation of these data as it is provided to us? Or do we continue to aspire to modeling more rigorous/precise semantics, and help submitters transform their data accordingly to fit our model?

For example, perhaps it is the case that in the ClinVar examples above, the submitters that annotated germline annotations to 'breast/ovarian cancer' itself really meant that it is oncogenic for this cancer, or that it is pathogenic for one of the related cancer predisposition syndrome described by 'Breast-ovarian cancer, familial 2' or 'Hereditary breast and ovarian cancer syndrome'. It may simply be that careless/imprecise selection an/or definition of the disease led to the statements asserting that BRCA mutations are pathogenic for the cancer itself.


The bottom line here is that we really need to understand all this so we can determine how to define and distinguish VPI from VOI w.r.t. constraints on variant origin, constraints on disease object, and semantics of the predicate. And then clearly document and provide guidance to data creators and consumers.

Hoping that the domain experts in this space can help clarify what is being asserted in the different ClinVar examples above, and if there are issues related to the rigor and precision of how the data is captured, as I suspect. And if there are issues, how big a deal are they . . . how important is it that we overcome this by creating a more precise data model?

DavidTamborero commented 5 years ago

referring to the confusion you mention in your point 2, a pathogenic variant indeed cause a condition (e.g. cystic fibrosis), but --in the cancer realm-- this condition is not a cancer but a cancer-predisposing syndrome.

Regarding the other points, the same variant can be reported in repositories as ClinVar as pathogenic regardless of their somatic/germline origin (see eg here). So according to our scheme, in these cases the variant should be classified as pathogenic (with the supporting evidence related to the germline findings and related to a cancer predisposing syndrome(s)) and oncogenic (as supported by the evidences found in the somatic findings and related to a cancer type(s)).

Note that, in the case of LoF events in tumor suppressors, a variant is important regardless of whether it is found germline or somatic. In other words, the TP53 c.916C>T missense variant of the previous example is going to be functional (i.e. LoF) regardless of whether it occurs germline or somatic. The point of the label (i.e. pathogenic or oncogenic) is to distinguish in the label itself in which context this effect has been demonstrated (e.g. it comes from germline data of a case-control study or it comes from the identification of a hotspot of somatic mutations).

hope it helps!

mbrush commented 5 years ago

Thanks David. re:

a pathogenic variant indeed cause a condition (e.g. cystic fibrosis), but --in the cancer realm-- this condition is not a cancer but a cancer-predisposing syndrome.

This is how we defined things and set up our model semantics initially as well:

But in ClinVar we see records such as this, asserting a germline variant to be pathogenic for a cancer - which breaks the rules we set out in our model. So the question is how to handle this. Curious if others agree with the characterization above, and/or have thoughts on how to handle 'rule-breaking' cases in ClinVar or other sources?


Types of 'Rule-breaking' assertions:

larrybabb commented 5 years ago

ClinVar shows both individual assertions about variants (SCVs) and aggregated (derived) assertions (RCVs-var/cond aggs and VCVs-var aggs). These last two can be quite confusing to users as they seem like assertions in themsleves, but they are merely aggregated assertions and can change with any new or updated change to the baseline submitted assertions made by individual labs and contributors.

In the case that David points out above the Somatic and Germline assertions are segregated (at the bottom). However, ClinVar has not really devised a super clear presentation and still aggregates an overall "Interpretation" at the top - which is at the best messy and more likely not useful for any practical use case. We would like to fix this confusion with ClinVar going forward and I believe there may be plans to help deal with that issue. So please do not confuse the top level Variant (VCV) or Variant/Disease (RCV) derived Clinvar aggregate assertions with actual assertions made by a lab or contributor that applied a methodology similar to AMP or ACMG for somatic or germline interps of variants, respectively.

larrybabb commented 5 years ago

Matt from previous post above...

But in ClinVar we see records such as this, asserting a germline variant to be pathogenic for a cancer - which breaks the rules we set out in our model. So the question is how to handle this. Curious if others agree with the characterization above, and/or have thoughts on how to handle 'rule-breaking' cases in ClinVar or other sources.

These germline direct associations to cancers could simply be confusion by the submitters. We can clarify by discussing with some domain experts. I can try to line up someone for a future VA call to clarify. But regardless of the answer, there will always be folks that will point to the wrong disease. The distinction to some of the asserters is too nuanced for their use. While we must be precise in our specification, we must also deal with the pragmatic applications and exceptions in a way that does not prevent these cases from being used going forward.

I maybe wrong and we may be able to require "only predisposing representations of cancer" for germline var path interps, but it will take some time to get the field to adopt it. I'm not sure what the best answer is other than we specify what it should be and mention that when it is not a predisposing form in a germline var path cancer interp, that we will assume it was meant to be so.

DavidTamborero commented 5 years ago

mmm but the problem here is that this particular ClinVar record is maybe not well annotated (the condition should be the predisposing syndrome, although note that the breast cancer that appears there is the 'aggregated' ClinVar condition, I do not see the condition stated by the original source in the table though )

Take home message is that (I would say) any event in any database can be represented by our scheme, but I m afraid this cannot be fully automatized.

DavidTamborero commented 5 years ago

ops, sorry i replied w/o seeing Larry's answer

mbrush commented 5 years ago

This all sounds reasonable. I think there is general agreement that the semantics we defined for our model are precise and generally correct, but our model may need to allow for data that does not follow these rules. Rather than encode these points in formal constraints, we should provide informal guidance to data providers, allow for messy data, and help users understand how to interpret it.

DavidTamborero commented 5 years ago

In case that the follow-up of the today s call is here; if I followed the discussions well (when I get to jump in I always have the feeling that you have already been discussing extensively what I m just thinking, sorry if this is the case), it seems to me that there are two different issues :

hope it helps!

mbrush commented 5 years ago

See comment here on the Variant Oncogenicity ticket about a proposal to collapse pathogenicity and oncogenicity interpretations into a single VA type - which addresses many of the issues/questions raised above.