ga4gh / va-spec

An information model for representing variant annotations.
Apache License 2.0
17 stars 4 forks source link

Predicted Functional Impact Annotation Definition and Scope #21

Open mbrush opened 5 years ago

mbrush commented 5 years ago

We will initially proceed with our initial decision to split 'Predicted' (#21) from 'Experimental' (#34) Functional impact annotations - and model these as separate VA types. Our rationale was that:

The proposals/notes below are derived from the initial requirements work for this VA type here.


Definition: A statement generated by a computational algorithm that predicts the impact a variant has on the functionality or behavior of a gene product (e.g. 'deleterious', 'damaging', 'tolerated').

Scope/Comments:

Sources of more info:

mbrush commented 5 years ago

Questions for Discussion:

mbrush commented 5 years ago

Outcomes/Actions from March 20 VA Call:

mbrush commented 5 years ago

Elements to capture in a statement model: (based on notes from initial requirements work here)


This is the first VA type where I feel that aligning with the ACM-based approach (casting the elements above into subject, predicate, descriptor and qualifier slots to precisely represent statement semantics) is a bit complicated. The challenge posed is rooted in the fact that an ACM-based model scopes an annotation to contain a single, primary statement with a single descriptor - but there are two elements above that represent descriptors of the variation (the impact score and the categorical prediction) - and sometimes only one or the other is provided. There is no compact way to capture this in a single annotation using the ACM slots (S, P, O Q). And treating the score as evidence for a categorical prediction creates an issue when only a score is given.

_We drafted an initial proposal for an ACM based model in the google doc here. Comments can be added to the doc and we will move the final proposal to the ticket here once it is hardened a bit._

mbrush commented 5 years ago

Listing some high-level evidence and provenance modeling requirements that emerged from review of the competency questions here - as for this particular VA type, I feel like E/P-related information may influence how we scope and structure the primary statement.

These requirements focus overwhelmingly on provenance, and minimally on evidence (the score underlying a prediction being the only evidence of import).
A separate ticket will be opened to discuss/document development of the E/P model for this VA type.

mbrush commented 4 years ago

Modeling here is essentially done, with exception of small issue with uncovered in pre-testing with BRCA Exchange data - which revealed an overlooked requirement from our initial analysis.

Our proposed model uses a Computational Impact Data Set as the descriptor (see here). This object is used to bundle the two types of 'data' typically reported in a CFI statement - a categoricalImpact, and an impactScore. But we have no structured way to represent the what type of scores these are, to help users understand their meaning and significance. This is important, given that there are a myriad of different types of computational impact algorithms that use different methods to derive scores describing different aspects of gene product function.

There are different ways the model might capture this important aspect of CFI statements. The red 'impact type' attribute in the proposal here represents one approach (an additional attribute to capture the 'impact type'). Alternatively we might represent the impact score itself as an object (as opposed to a literal) where we could hang this type information. We need to evaluate the adequacy of the proposed and alternate approaches.

mbrush commented 4 years ago

Example of the proposed model used to represent a 'prior probability of pathogenicity' computational impact prediction reported on the BRCA Exchange website here: https://brcaexchange.org/variant/287750.

 - id: ex:Statement001
   type: va:ComputationalFunctionalImpactStatement
   subject: brcaexchange:287750 # BRCA1 NM_007294.3:c.2864C>G
   descriptor: 
      - id: ex:CFIData001
        type: va:ComputationalImpactStudyData
        impactType: 'in silico prior probability of pathogenicity (protein-level estimation)'
        impactScore: 0.99
    method: HCI Breast Cancer Genes Prior Probabilities Algorithm

Note here that we rely on the method being captured to allow user to find out more about the impact type . . .

mbrush commented 4 years ago

On the March 18 VA call, we discussed the possibility of changing the names for the attributes in the Computational Impact Data Set - essentially replacing 'impact' with 'prediction'.

An alternate naming scheme could be:

This was motivated by the fact that some CFI statements use categorical terms that don't describe impact on gene function directly (at lease superficially). e.g. the BRCA Exchange example above is called a 'prior probability of pathogenicity' prediction (but under the hood the prediction is about impact on gene function, and this is expressed as a probability that this altered function will be pathogenic).

Using the more explicit/detailed of these labels in the BRCAExchange example above (and adding a fake categorical value to see how all three look together), we would get the following

 - id: ex:Statement001
   type: va:ComputationalFunctionalImpactStatement
   subject: brcaexchange:287750 # BRCA1 NM_007294.3:c.2864C>G
   descriptor: 
      - id: ex:CFIData001
        type: va:ComputationalImpactStudyData
        predictionType: 'in silico prior probability of pathogenicity'
        predictedImpactScore: 0.99
        predictedImpactCategory: 'Pathogenic'
    method: HCI Breast Cancer Genes Prior Probabilities Algorithm
larrybabb commented 4 years ago

Prediction: { impact score, impact category, description/interpretation }. I think it could get confusing to have attributes that end in “xxxType”, as it gets into the concept of “classifying” the predicted impact very specifically.


Isn’t it natural to conflate the “type” attribute with the “predictionType” attribute. Or is this more of a result that the “predicted impact” concept is flattened and thus needs to be qualified as “predictionType”?


the above example shows that the Type of the "descriptor" is "va:ComputationalImpactDataSet". So is that ComputationalImpactDataSet Descriptor holding a complex type called "prediction" that contains a impact category, impact score and a text or coded interpretation of the prediction.

mbrush commented 4 years ago

Thanks @larrybabb, I agree it is still a bit confusing. Will brainstorm more on this, and record alternate proposals here.

mbrush commented 4 years ago

The purpose of the predictionType/impactType field proposed above is to describe the aspect of gene product function or significance that the score is about. Depending on the algorithm generating the prediction, this may be: (1) something very general - e.g. the variant's "generic impact on gene product function" (deleterious vs tolerated); (2) something more pointed - e.g. its "impact on transcription factor activity", "impact on gene product stability"; or (3) translated into other terms - e.g. impact on overall function of a disease-related gene is often framed as "probability of pathogenicity".

Below is a proposal to capture this as the type of algorithm that generated the impact study data - as I think this concept fits cleanly in the context of a ComputationalImpactStudyData object. I also propose new names for the attributes holding the data itself - 'impactScore' and 'impactClassification' (instead of impactCategory).

 - id: ex:Statement001
   type: va:ComputationalFunctionalImpactStatement
   subject: brcaexchange:287750 # BRCA1 NM_007294.3:c.2864C>G
   descriptor: 
      - id: ex:CFIData001
        type: va:ComputationalImpactStudyData
        impactScore: 0.99
        impactClassification: 'Pathogenic'
        algorithmType: 'in silico prior probability of pathogenicity'
    authoredBy: 'HCI Breast Cancer Genes Prior Probabilities Predictor'