Variant Oncogenicity Interpretation definition and scope

Initial notes on proposed scope and definition of these VA type, based on requirements and considerations documented here.

Definition: A statement about the contribution made (or lack thereof) by somatic variant to a specific type of cancer, wherein the variant is described along a spectrum from benign to pathogenic.

Scope Notes:

These statements are constrained to be about somatic variants and cancers to which they contribute.
Assertions about the pathogenicity of germline variants for heritable conditions are described by a different VA type (Variant Pathogenicity Interpretation). This includes pathogenicity of germline variants for various cancer predisposition syndromes (e.g. Hereditary cancer-predisposing syndrome).

Comments:

Somatic variants can contribute to the pathogenesis of a cancer in different ways, e.g. by initiating its onset, or enabling or modifying its progression.

Issues/Questions:

Subject: somatic variants - represent with qualifier as in Variant Pathogenicity? Any nuance to consider here besides single 'somatic' value?
Descriptor: limit to the 'Cancer' subset of disease/genetic condition. Any nuance to consider beyond this? a. As for VPI, will need to consider modeling here - is a single ontology term sufficient? or need a more complex object model to build up / post-compose a Cancer description?
Predicate: What is the set of relationships we want to make here? how granular?
Qualifiers: will use this to specify allele origin (somatic), and possible to capture mechanism of pathogenesis (e.g. oncogene activation vs TSG inactivation)
Evidence/Provenance: likely as complex as for germline VPI. Arpad/Dimitry to present CIViC models and planned ACMG-like guidelines to inform requirements here.

Regarding the question of if/how to capture mechanism of pathogenicity (e.g. oncogene activation vs TSG inactivation) as part of this VA type, first we need to consider if this is even in scope for the primary statement here. It may be that this mechanistic aspect represents completely a different statement that we should create a separate VA type for.

If it is in scope here, we could do this using a qualifier with values like 'driver', 'modifier . . . or 'oncogene activation' and 'TSG inactiviation'. Alternatively, we could model this into the predicate, by defining a more granular set of relationships extending the basic ACMG-like ones. (e.g. is_oncogenic_driver_of).

I put some thoughts here in case I can not discuss with you online. As a disclaimer, remember that I have no expertise in developing data models, what I have is good experience in constructing genomic interpretation tools and also in interacting with users with different profiles/needs in both research and clinical setting. Putting my comments in that context, please see the following (and please apologies for any content that may be irrelevant at this point of your discussions)

I would keep the ‘high level’ interpretation terms simple, so the ‘main’ classification can be understood at a first sight by everyone. Therefore, I will define the variant effect main term in the line of oncogenic/likely oncogenic/vus/likely neutral/neutral (see next point). The more elaborated terms (LoF, switch of function, gain of function, truncating variant, disrupting event, etc), can be sometimes tricky to understand –specially in the context of certain genes-- and I would leave this as a more detailed info in an additional field (but I would indeed have such additional field; see one of the points below)
Regarding these main effect terms, I would like to keep the 5 terms (e.g. oncogenic/likely oncogenic/vus/likely neutral/neutral ) for two reasons: (a) it makes sense to have two tiers of how sure you are of the reported effect (‘it is kind of certain’ and ‘it is likely certain’ ); and (b) it is nice to make it consistent with the pathogenic/likely pathogenic etc model
Regarding the specific labels for that, I vote for not using ‘pathogenic’ and ‘benign’ for the somatic variants, so it can be distinguished from the terms used for germline predisposing/causing effects. What label to use, likely 10 people would have 10 favourites. I use to use ‘oncogenic’, ‘likely oncogenic’, ‘vus’, ‘likely neutral’ and ‘neutral’ --as e.g. in OncoKB--. But I acknowledge that this can be confusing if identified that is an effect in only oncogenes. Other options can be driver and passenger, but can be too technical. Tumorigenic and non-tumorigenic ?
I think that the cancer type should be part of the info, at least as an optional field. Note here that many people, when talking about somatic oncogenic events, do not believe in the need of including the cancer type since somehow they believe that the ‘oncogenic’ definition is universal. I would advise against that, since (a) some (although it is a minority) of the oncogenic variants are likely to be context-dependant (a variant oncogenic in a tumor tissue can be neutral in another, and viceversa); (b) the cancer type in which the reported effect has been tested is --in any case-- a useful info (and it is up to the user whether this can be extrapolated to other cancer types)
Regarding the last point, note that for some studies, to define the cancer type in which the particular effect has been evaluated is tricky (e.g. loose cancer type experimental models due to different reasons that I will not enumerate here). Therefore, you need to allow a ‘not speciifc cancer type’ or similar term meaning that this info can not be specified.
another fundamental question is the level of strength for stating a given effect. In our case, the effect can be reported e.g. to be oncogenic or likely oncogenic, but a orthogonal question is the strength of the evidence to sustain that. For instance, a cancer cohort study can conclude that a variant is oncogenic, but maybe that study has some caveats (e.g. the sample size); however, a experimental study can conclude that a variant is likely oncogenic (so it is not even certain that is oncogenic, due to a reason ‘x’), but the quality of the experimental data to say so is adamant. Note that some knowledgebases only ‘accept’ data with a certain level of quality in the studies that report the variant effect, but others include both the level of relevance (that for oncogenic variants means the 5-level classification, for biomarkers of drug response can be ranged from a clinical guideline to a pre-clinical observation, etc) as well as the level of strength (how good is the clinical or pre-clinical study that report that level of relevance in the drug biomarker example). Since we are developing a data model and not a database here, I would say that we need to include a field with the level of strength supporting the oncogenic/neutral effect.
As stated before, I d like to see an optional field with the mechanism of action of the variant (when it is found to be oncogenic); loss-of-function, gain-of-function, etc
I would like to see also a reference of the study(ies) in which the effect of the variant has been reported (e.g. pubmed id and –for the emerging ones-- a conference abstract).
I do not know to which extent a ‘other comments’ field is technically acceptable to be included, but I always think there is room for such a thing. For this variant model, this could include details of the level of evidence of the effect (e.g. if it is based in experimental data, to write some details about that experiment).Some comments in case i can not discuss with you online. As a disclaimer, remember that I have no expertise in developing data models, what I have is a solid experience in constructing genomic interpretation tools and also in interacting with users with different profiles/needs in both research and clinical setting. Putting my thoughts in that context, please see the following (and please apologies for any content that may be irrelevant at this point of your discussions)

Regarding the last point, note that for some studies, to define the cancer type in which the particular effect has been evaluated is tricky (e.g. loose cancer type experimental models due to different reasons that I will not enumerate here). Therefore, you need to allow a ‘not speciifc cancer type’ or similar term meaning that this info can not be specified.

This discussion is equivalent to the one about leaving the "condition" field blank in Variant Pathogenicity type, am I right? https://github.com/ga4gh-gks/variant-annotation-model/issues/25

Regarding these other points:

another fundamental question is the level of strength for stating a given effect. [...] I would say that we need to include a field with the level of strength supporting the oncogenic/neutral effect.

I would like to see also a reference of the study(ies) in which the effect of the variant has been reported (e.g. pubmed id and –for the emerging ones-- a conference abstract)

a ‘other comments’ field is technically acceptable to be included, but I always think there is room for such a thing. For this variant model, this could include details of the level of evidence of the effect (e.g. if it is based in experimental data, to write some details about that experiment).

Sounds to me like they are all related with evidence and provenance. Definitely interesting to take into account. We'll handle them when we get to modelling evidence/provenance.

Outcomes and Issues following 1-23-19 VA Call

Subject:

Scope: as a placeholder we define 'Variation' as the subject, which is broader than the notion of a 'Variant', in that it includes non-sequence variation concepts (e.g. inc/dec expression).
From David T: "any alteration can be oncogenic, including genomic variants as mutations or CNAs, changes in expression/protein levels or even epigenomic events. and for the mutations, you can have specific genomic changes (a particular nucleotide change or indel), but also more loose terms as 'gene x , exons y-z loss', or 'inframe mutations in gene domain x' and an almost an infinite number of combinations between gene domain or region and mutation type "
- AI: As we explore additional data examples, use cases, and somatic VA types, we will refine and formalize the notion of a 'Variation', consider its relation to the 'Molecular Profile' concept Alex introduced, and work with VR to define a suitable representation.
The variation here is necessarily somatic in origin. We will capture this using a variantOriginQualifier (as opposed to an attribute of a variant directly, or creating a 'Somatic Variant' subtype), as the somatic origin is relevant only in the context of a given annotation. See https://github.com/ga4gh-gks/variant-annotation-model/issues/22.

Descriptor:

Limit values to the 'Cancer' subset of disease/genetic condition. Any nuance to consider beyond this?
As for Pathogenicity Interpretations, we need to consider if this is a required field, and how to capture data when the condition/cancer is blank or 'not specified'. See #25.
As for the Genetic/Mendelian Condition descriptor in Pathogenicity Interpretations, will need to consider modeling here - is a single ontology term sufficient? Or will we need a more complex object model to build up / post-compose a Cancer description?
- AI: ClinGen and Monarch are both engaged in Condition/Disease modeling activities to meet needs of their respective applications. We should coordinate with these groups, and leverage their use cases to drive modeling in context of Schema Blocks effort.

Predicate:

Do we want to constrain with a value set? if yes, how granular should the semantics of the predicate be?
Several people have indicated preference to keep simple - 5 ACMG-like levels. And to keep 'mechanism' out of the predicate. Preference seems to be to use the term 'oncogneicity' or 'pathogenicity' as opposed to 'tumorigenicity'.

Qualifiers:

We will create qualifier properties to capture variant origin (somatic), and possibly to capture mechanism of pathogenesis (e.g. oncogene activation vs TSG inactivation)
variantOriginQualifier: same considerations as for pathogenicity interpretations, but value here is 'somatic'
pathogenicMechanismQualifier:
- If/how to incorporate the oncogenic mechanism info into this VA type (e.g. activation of oncogene vs inhibition of TSG) – as part of the core statement, or as separate supporting information.
- Initial thought is that this is relevant to include in this VA type.
- It could be captured as qualifier . . . refining the meaning of primary statement to something like "variant X is oncogenic for Cancer Y through mechanism Z (e.g. gain of function mutation).
- We would need to identify/define terms to use as values for this qualifier - consider the set of values for CIViC Biological Assertions as a starting point (gain of function, loss of function, neomorphic, loss of function)
- Another possibility is to capture mechanism not as part of the primary statement (i.e. not using a qualifier), but instead as a separate 'supporting statement' that the variant has a certain functional impact, and is bundled into the oncogenicity annotation outside the primary statement (see Molecular Consequence annotation message examples here for what this may look like.)
- We could even use the existing Functional Impact statement type here - but would want to consider if/how this may be different (when we get to modeling the functional impact VA types)

Evidence:

Will likely be rich and detailed, as for Variant Pathogenicity Interpretations.
I believe that the biological evidence objects that CIVic is starting to curate are largely used as evidence for oncogenicity interpretations. @arpaddanos perhaps you could list the types of data that are captured here - e.g. I suspect these may include population frequency data, functional impact data, computational predictions, and other data types that are specified as evidence in the ACMG germline interpretation framework. @larrybabb consider how these evidence types align with ClinGen VI model Statement types, and ACMG germline criteria.
we should also follow Dimitry and @gaberudy's work on oncogenicity interpretation frameworks that parallel the ACMG germline pathogenicity.

Given discussions and feedback on recent calls, we are exploring the idea of collapsing Variant Pathogenicity Interpretation (VPI) and Variant Oncogenicity Interpretations (VOI) into a single VA type (Variant Pathogenicity Interpretation). Motivations for collapsing are based on both semantic and pragmatic considerations:

Semantics (the similar meaning of these statements): Both make assertions about a variant causing or contributing to the development of a disease. While our original split between VPI and VOI allowed us to put a lightly finer point on things w.r.t. whether a variant can cause on its own or only contribute to a disease, the benefit of this may not outweigh the burden of having users understand this distinction and structure their data accordingly.
Pragmatics (the realities of actual/messy data): in our landscape analysis we came across cases that do not fit neatly into the VPI and VOI buckets we defined. For example, this ClinVar record, asserting a germline variant to be pathogenic for a cancer. A single, broader VA type that allows any assertion about a variant (somatic or germline) leading to the the development of (causing or contributing) a particular disease (Mendelian or cancer) is accommodating of the broadest range of annotations we find in real data, with minimal effort for data providers to find the right VA type and transform their data into a compliant structure.

A proposal for a collapsed model is defined in the spreadsheet here, and reflects the following decisions/considerations:

We recommend the predicate set {pathogenic_for, likely_pathogenic_for, benign_for, likely_benign_for, uncertain_significance_for} - where 'pathogenic' is defined broadly enough to cover causal or contributing variant-disease relationships, to accommodate interpretations on Mendelian conditions and cancer, respectively. The context in which the predicate is used can inform the whether the variant is asserted to be causal vs contributing for the indicated condition: if the condition is a Mendelian, the implication is that the variant is causal; if the condition is a Cancer, the implication is that the variant is a contributing driver. One con here is that consumers in the cancer space might expect to see terms like 'oncogenic' - but our documentation can be clear that this is covered by 'pathogenic'. But this may be more a presentation-level issue that can be handled by UI software layer, and not a concern at the lower level of a data exchange schema.
The collapsed model includes the 'qualifier' fields we created for both oncogenic and pathogenic assertions (specifically, variantOriginQualifier and pathogenicMechanismQualifier). Documentation will guide users on when to apply each.
Our evidence and provenance model will need to support very broad types of evidence and different granularity of detail - from rich representation of ACMG-based evidence interpretation, to sparser representations that might accommodate interpretations where no formal guidelines are used. This will be a challenge, but one I think a SEPIO-based approach is equipped to handle. Even though different evidence frameworks/criteria are typically used to evaluate a variant in cancer vs Mendelian disease, there is overlap in the types of info used as evidence. And, as seen in ClinVar records such as this and this, guidelines like the ACMG used for evaluation against Mendelian conditions are in practice used to evaluate germline and somatic variants for cancer. So I think that even if we separated VPI form VOI, we would have to provide the same type of flexible evidence/provenance model.

A next step is to test the model against the diverse examples of ‘pathogenicity’ assertions, and decide if we are happy with how it handles things.

Below are some example records organized according to the different scenarios we encountered in our landscape review. Consider if our model supports/makes sense for each category, and we can dive deeper if needed to model out the actual examples.

I. Germline variant pathogenic for:

Mendelian Condition: (most germline ClinVar records are of this type)
- Cystic Fibrosis: https://www.ncbi.nlm.nih.gov/clinvar/RCV000007544/
Cancer Predisposition Syndrome: (which we interpret as a special case of a Mendelian disorder)
- Hereditary Cancer-Predisposing Syndrome: https://www.ncbi.nlm.nih.gov/clinvar/RCV000708772/
- Breast cancer familial 2: https://www.ncbi.nlm.nih.gov/clinvar/RCV000258282/
Cancer:
- Breast and/or ovarian cancer: https://www.ncbi.nlm.nih.gov/clinvar/RCV000735621/
- Neoplasm of the Breast: https://www.ncbi.nlm.nih.gov/clinvar/RCV000240780/ (a germline variant interpreted for a cancer using ACMG guidelines)

II. Somatic variant pathogenic for:

Cancer (most somatic ClinVar records are/should be of this type)
- Neoplasm of the Breast: https://www.ncbi.nlm.nih.gov/clinvar/RCV000240693/ (a somatic variant interpreted for a cancer using ACMG guidelines)
Cancer Predisposition Syndrome:
- Hereditary cancer-predisposing syndrome > Breast cancer familial 2: https://www.ncbi.nlm.nih.gov/clinvar/RCV000241127/ (this record has the same variant in somatic and germline contexts as pathogenic for this condition . . . doesn't seem right to say that the somatic variant is pathogenic for a 'mendelian' discorder)
Mendelian Condition:
- didn't find examples of this scenario (but didn't look very hard)

NOTE from Larry: The ClinVar RCVxxx examples referenced above are not truly reflective of the pathogenicity assertion. In ClinVar this would be more closely reflected in the SCVxxx records, but there is no url to access it directly. The RCVs are aggregations of 1 or more SCVs for the same variant-disease matches from multiple submitters. So, RCVs of 1 SCV look to be the same. Once there are more than 1 SCVs aggregated in an RCV you will note that the method to resolve the "discrepancies" is really the means to making this higher level aggregate assertion, not yet modeled precisely in the VA group.

Update: While we have tentatively decide to group Mendelian Disease and Cancer into a single VA type, it is not yet clear if we also want to lump Common disease in here as well.

We have not encountered variant interpretations for polygenic / common disease among our driving use cases, so we don’t have as deep an understanding of the semantics of these interpretations, and if/how they should be modeled here.

Proposal: For now, defer a decision on this issue. Define the Pathogenicity Interpretation VA type for Mendelian and Cancer. Note in our documentation that we do not yet explicitly support interpretations for common disease, but if the semantics of such an interpretation aligns with the model here, it can be use for this. If and when there is a demand for variant - common disease interpretations, we will do our due diligence and decide how to lump or split.

ga4gh / va-spec

Variant Oncogenicity Interpretation definition and scope #23

Outcomes and Issues following 1-23-19 VA Call