Clarify the semantics of Oncogenicity vs Pathogenicity

mbrush commented 5 years ago

This ticket aims to get a better handle on how oncogenicity and pathogenicity are related, and the constraints we should model around them with respect to variant origin, the disease object, and the semantics of the predicate in these annotation types.

At the heart of this is clearly distinguishing what is being stated in an interpretation that says record that says a BRCA variant is oncogenic for breast/ovarian cancer (e.g. here), vs pathogenic for breast/ovarian cancer (e.g. here), vs pathogenic for a cancer predisposing syndrome (e.g. here). Lots of issues to unpack here, which I attempt to do below. Hoping we can discuss in this ticket and/or on upcoming VA calls.

1) What is the difference between a proper cancer and a 'familial cancer' or 'cancer predisposition syndrome' (as defined in terminologies like MedGen)?

e.g. what is the difference between the following term in the MedGen hierarchy of diseases:

Hereditary breast and ovarian cancer syndrome
Breast-ovarian cancer, familial (seems to be a child term of the above term in MedGen)
Breast and/or ovarian cancer (no apparent relationship to others in MedGen hierarchy)

. . . in MedGen, descriptions of the first two conditions sound exactly the same (see links above), describing an increased chance of developing the cancer (which could be linked to a single pathogenic variant). This is consistent with the fact that the second is a child of the first in the MedGen hierarchy. But the third describes the cancer itself (which would not be linked to a single pathogenic variant, but perhaps to several oncogenic variants)

2) How should we define the semantics of 'oncogenic for' vs 'pathogenic for'? The commitment we made initially in defining Variant Pathogenicity Interpretations (VPIs) and Variant Oncogenicity Interpretations (VOIs), holds that:

pathogenic = the presence of a (germline) variant on its own (at the appropriate allelic requirement) can cause the disease . . . which seemed reasonable given that we constrained the scope of VPI annotations to Mendelian conditions.
oncogenic = the presence of a (somatic) variant may be required for or contribute to the development of a Cancer, but alone is not sufficient to cause the disease . . . this is a bit weaker w.r.t. the causal implications.

By these definitions we can say that a germline variant is pathogenic for a Mendelian condition or a cancer predisposition syndrome, and that a somatic variant is oncogenic for a Cancer. But it would not make sense to allow the statement that a BRCA variant is pathogenic for a Cancer - because a single variant cannot be pathogenic for a multi-gene condition like Cancer. But such assertions are made in many ClinVar records (see examples below).

We should discuss/hear from experts - but ultimately may need to loosen/refine the definitions above to account for the realities presented by the data (e.g. perhaps we can allow for 'pathogenic for' to mean 'cause or contribute to', instead of 'cause on its own' . . . this would allow its use with Cancers, but also perhaps loose some precision, such that we aren't able to differentiate between variants that are causal vs contributory). Alternatively, we stick to our guns and provide guidance for 'correcting' data that break the rules of our model.

3) What types of conditions should be allowed in the descriptor slot for VPI vs VOI statements? Consider what is meant when the conditions in point (1) above are used in variant pathogenicity/oncogenicity interpretations? For example, in ClinVar BRCA2 variants are asserted to be pathogenic for all of them (see here). Under what semantics of the predicate and condition terms are these statements valid?

4) Should we modify the semantics of our model to allow less precise representation of these data as it is provided to us? Or do we continue to aspire to modeling more rigorous/precise semantics, and help submitters transform their data accordingly to fit our model?

For example, perhaps it is the case that in the ClinVar examples above, the submitters that annotated germline annotations to 'breast/ovarian cancer' itself really meant that it is oncogenic for this cancer, or that it is pathogenic for one of the related cancer predisposition syndrome described by 'Breast-ovarian cancer, familial 2' or 'Hereditary breast and ovarian cancer syndrome'. It may simply be that careless/imprecise selection an/or definition of the disease led to the statements asserting that BRCA mutations are pathogenic for the cancer itself.

The bottom line here is that we really need to understand all this so we can determine how to define and distinguish VPI from VOI w.r.t. constraints on variant origin, constraints on disease object, and semantics of the predicate. And then clearly document and provide guidance to data creators and consumers.

Hoping that the domain experts in this space can help clarify what is being asserted in the different ClinVar examples above, and if there are issues related to the rigor and precision of how the data is captured, as I suspect. And if there are issues, how big a deal are they . . . how important is it that we overcome this by creating a more precise data model?

DavidTamborero commented 5 years ago

referring to the confusion you mention in your point 2, a pathogenic variant indeed cause a condition (e.g. cystic fibrosis), but --in the cancer realm-- this condition is not a cancer but a cancer-predisposing syndrome.

Regarding the other points, the same variant can be reported in repositories as ClinVar as pathogenic regardless of their somatic/germline origin (see eg here). So according to our scheme, in these cases the variant should be classified as pathogenic (with the supporting evidence related to the germline findings and related to a cancer predisposing syndrome(s)) and oncogenic (as supported by the evidences found in the somatic findings and related to a cancer type(s)).

Note that, in the case of LoF events in tumor suppressors, a variant is important regardless of whether it is found germline or somatic. In other words, the TP53 c.916C>T missense variant of the previous example is going to be functional (i.e. LoF) regardless of whether it occurs germline or somatic. The point of the label (i.e. pathogenic or oncogenic) is to distinguish in the label itself in which context this effect has been demonstrated (e.g. it comes from germline data of a case-control study or it comes from the identification of a hotspot of somatic mutations).

hope it helps!

mbrush commented 5 years ago

Thanks David. re:

a pathogenic variant indeed cause a condition (e.g. cystic fibrosis), but --in the cancer realm-- this condition is not a cancer but a cancer-predisposing syndrome.

This is how we defined things and set up our model semantics initially as well:

We can say a single germline variant is causal for a Mendelian condition like cystic fibrosis, or that a single germline variant is causal for a cancer predisposing syndrome. In these cases we use 'pathogenic' to indicate causality - i.e. that this is the only variant required to get the disease.
Likewise, we can say that a somatic variant contributes to / drives development of a cancer. Here we use 'oncogenic' to indicate this contributory/driving relationship.
But we would NOT say that a variant, be it germline or somatic, is causal/pathogenic for a cancer - because cancer requires more than one mutation to develop.

But in ClinVar we see records such as this, asserting a germline variant to be pathogenic for a cancer - which breaks the rules we set out in our model. So the question is how to handle this. Curious if others agree with the characterization above, and/or have thoughts on how to handle 'rule-breaking' cases in ClinVar or other sources?

Types of 'Rule-breaking' assertions:

somatic variant pathogenic_for Cancer -> oncogneic_for
germline variant pathogenic_for Cancer -> ???
somatic variant pathogenic_for cancer predisposing / familial cancer syndrome -> ???
. . .

larrybabb commented 5 years ago

ClinVar shows both individual assertions about variants (SCVs) and aggregated (derived) assertions (RCVs-var/cond aggs and VCVs-var aggs). These last two can be quite confusing to users as they seem like assertions in themsleves, but they are merely aggregated assertions and can change with any new or updated change to the baseline submitted assertions made by individual labs and contributors.

In the case that David points out above the Somatic and Germline assertions are segregated (at the bottom). However, ClinVar has not really devised a super clear presentation and still aggregates an overall "Interpretation" at the top - which is at the best messy and more likely not useful for any practical use case. We would like to fix this confusion with ClinVar going forward and I believe there may be plans to help deal with that issue. So please do not confuse the top level Variant (VCV) or Variant/Disease (RCV) derived Clinvar aggregate assertions with actual assertions made by a lab or contributor that applied a methodology similar to AMP or ACMG for somatic or germline interps of variants, respectively.

larrybabb commented 5 years ago

Matt from previous post above...

But in ClinVar we see records such as this, asserting a germline variant to be pathogenic for a cancer - which breaks the rules we set out in our model. So the question is how to handle this. Curious if others agree with the characterization above, and/or have thoughts on how to handle 'rule-breaking' cases in ClinVar or other sources.

These germline direct associations to cancers could simply be confusion by the submitters. We can clarify by discussing with some domain experts. I can try to line up someone for a future VA call to clarify. But regardless of the answer, there will always be folks that will point to the wrong disease. The distinction to some of the asserters is too nuanced for their use. While we must be precise in our specification, we must also deal with the pragmatic applications and exceptions in a way that does not prevent these cases from being used going forward.

I maybe wrong and we may be able to require "only predisposing representations of cancer" for germline var path interps, but it will take some time to get the field to adopt it. I'm not sure what the best answer is other than we specify what it should be and mention that when it is not a predisposing form in a germline var path cancer interp, that we will assume it was meant to be so.

DavidTamborero commented 5 years ago

mmm but the problem here is that this particular ClinVar record is maybe not well annotated (the condition should be the predisposing syndrome, although note that the breast cancer that appears there is the 'aggregated' ClinVar condition, I do not see the condition stated by the original source in the table though )

Take home message is that (I would say) any event in any database can be represented by our scheme, but I m afraid this cannot be fully automatized.

DavidTamborero commented 5 years ago

ops, sorry i replied w/o seeing Larry's answer

mbrush commented 5 years ago

This all sounds reasonable. I think there is general agreement that the semantics we defined for our model are precise and generally correct, but our model may need to allow for data that does not follow these rules. Rather than encode these points in formal constraints, we should provide informal guidance to data providers, allow for messy data, and help users understand how to interpret it.

DavidTamborero commented 5 years ago

In case that the follow-up of the today s call is here; if I followed the discussions well (when I get to jump in I always have the feeling that you have already been discussing extensively what I m just thinking, sorry if this is the case), it seems to me that there are two different issues :

first, the number of 'variant contexts' --or whatever you call them-- that deserve different data models. I hope that we all agree that variants involving germline diseases and somatic cancer events should have different data models (i.e. 'fields to fill') to capture the details of the reported effect (if not, I m missing a large part of the work of the group!). If so, I guess that the problem here is whether the current definitions for each of these contexts (the ones you described before) hold in the very specific case of germline cancer-predisposing variants, which somehow fall in the middle of these two contexts. And in my opinion, the definition indeed holds as far as we consider that the cancer-predisposing syndrome qualifies as the condition itself.
(related to the previous): regarding the case in which a variant is described as both associated to cancer-predisposing syndromes (germline) but also observed in the somatic cells as a 'functional' event (as the discussed ClinVar example); I think that the most neat thing would be to fill the 'pathogenic data model' with the details describing the former and also fill the 'oncogenic data model' with the details of the latter, so this variant will have entries for both. This would be equivalent to having a variant associated to a multi-factorial (multi-genic) disease which also causes a Mendelian disease. In this case (which I have no idea of whether it may exist), I assume that the variant will need to have an entry in the data model of both contexts (since you have a data model for multi-genic conditions, right?)
(also regarding the previous): if a germline variant is called pathogenic for e.g. breast tumor (and not breast cancer-predisposing, as the other ClinVar example) I would say that this is not the most accurate statement, although not enterely wrong (since the final phenotype is the breast cancer). But note that this minor lack of accuracy has nothing to be with the ClinVar data model, but with the way that the curator filled this particular entry. This also applies to our data model: if somebody fills the info of this variant as an entry of the pathogenic data model (since it indeed contains the fields that are relevant for this variant), if he/she fills the condition entry as 'cancer x' instead of 'cancer x-predisposition', it would be a certain inaccuracy that has nothing to be with our data model being wrong.
and second, another distinct issue (which I felt that was a bit mixed during the call with all of the above) is the terms to be used as 'final classification terms' of the variants in these different data model contexts. In other words, if we want to use the classification of pathogenic-likely pathogenic etc to e.g. variants associated to Mendelian diseases (and cancer predisposing variants) as well to oncogenic somatic variants, or we want to use something like 'oncogenic/likely oncogenic' etc for the latter. This is kind of a minor issue, but my vote is still to use the separate terms option, since in my experience with users (mainly clinicians) to have labels that are self-explanatory of different contexts is useful when reporting findings. But again, is kind of a minor issue, so maybe I m missing some point of the debate.

hope it helps!

mbrush commented 5 years ago

See comment here on the Variant Oncogenicity ticket about a proposal to collapse pathogenicity and oncogenicity interpretations into a single VA type - which addresses many of the issues/questions raised above.

ga4gh / va-spec

Clarify the semantics of Oncogenicity vs Pathogenicity #29