Generalize the 'Case-Control' Annotation Type

The 'Case-Control' Annotation Type we proposed a Variant Annotation category in our initial list here was informed by the ClinGen CaseControl Annotation example here. It is defined as "an annotation about the relative frequency of an allele as in affected vs unaffected study groups of a case-control clinical study"

Proposing here to generalize this variant annotation category so it is inclusive of any study comparing variant frequency between study groups (i.e. not just case-control based studies).

We should consider what other types of studies we may want/need to include here (e.g. cohort studies? studies comparing freq of a variant in ER+ breast cancer and ER- breast cancer). Then define a name that fits/covers all of these, and specify the definition and scope of what is stated in this category of annotation.

Proposal:

Annotation Type: Relative Population Allele Frequency
Definition: an annotation about the relative frequency of an allele as in two defined populations or cohorts (e.g. affected vs unaffected study groups of a case-control clinical study)
In scope for this annotation type: findings/data from a particular study, which may include variant frequencies calculated for each population/cohort, and optionally things like odds ratios (OR) or relative risk scores (RR) derived from these frequencies.
Out of scope for this annotation type: broader conclusions/clinical interpretations about things like the risk/predisposition of carriers of the variant for some disease, or the pathogenicity of the variant. These would be higher-order annotations that might be based on one or more relative population frequency annotation used as evidence to infer such a broader conclusion. See proposal for a 'Predisposition Annotation' in issue #3

See additional notes in the revised category proposal here.

Applying this proposal to the new example Steven Hart added to the Requirements doc here, we would get the following:

Annotation1: A "Predisposition Annotation" that includes case control data (odds ratio) as evidence.

Variant: NM_007194.3(CHEK2):c.1100delC (p.Thr367Metfs)
Predisposition: Breast Cancer
Risk Level: Moderate
Metric Type: Odds Ratio
Metric Value: 3.18 
Metric Confidence Intervals (95%): 2.01-4.92
Metric p-Value: 0.00000061
Study Type: Case/Control
Case Frequency: 0.0134
Control Frequency: 0.0040
PMCID: PMC5740532

According to the proposal, this would represent a 'Predisposition' or 'Risk Factor' type annotation - as its primary statement asserts that the variant correlates with a moderate risk for breast cancer. It includes a secondary statement that reports the outcome of a case-control study, that represents evidence for the primary assertion. In the model we develop, I hope that the structure of the annotation will make the identities of and relationship between these two statements explicit (as it does in the ClinGen Variant Interpretation Model).

Note that this secondary case-control finding could exist as the primary statement in a separate annotation, that we might categorize as a 'relative population frequency' annotation.

Annotation 2: A "Relative Population Frequency" annotation reporting freq in case vs control, and an odds ratio.

Variant: NM_007194.3(CHEK2):c.1100delC (p.Thr367Metfs)
Condition: Breast Cancer
Case Frequency: 0.0134
Control Frequency: 0.0040
Metric Type: Odds Ratio
Metric Value: 3.18 
Metric Confidence Intervals (95%): 2.01-4.92
Metric p-Value: 0.00000061
Study Type: Case/Control
PMCID: PMC5740532

NOTE in the comment below I provide an updated/extended version of these examples that includes data from two case control studies modeled as Relative Population Frequency statements, and captured as evidence for a Predisposition/Condition Risk annotation.

Doubting about the necessity of a "Relative Population Allele Frequency" type given that

Details about cohort numbers and frequencies within each cohort could be provided as generic "Cohort frequency" (rather than population frequency) type.
Particular scores related with predisposition (e.g. OR) would appear within the "Variant Predisposition Annotation" type

Re:

Particular scores related with predisposition (e.g. OR) would appear within the "Variant Predisposition Annotation" type

. . . this raises the question of how these scores would 'appear' within the "Variant Predisposition Annotation" type. If there is not an existing structure for capturing relative population frequency data/scores as a type of variant annotation that could be plugged in or referenced in the context of a variant predisposition annotation where this info is used as evidence, then we'd have to define some other standard structure for representing this info in this context.

I am writing up some more general thoughts that were raised my your comment, and will post here shortly.

It may be worth considering the requirement that drove the proposal for a this type of variant annotation - which comes from ClinGen (and by extension any ACMG-based variant interpretation use case). This use case required that our model is able to represent the results of 'case-control' studies - as these are a key data type defined in the ACMG Guidelines that is used as evidence for evaluating variants against the PS4 criteria.

This raises the more general point that in scoping our model and the types of annotations we represent, we should be guided in part by the types of information needed to evaluate a variant using VI guidelines like the ACMG. The ClinGen-SEPIO model has done a nice job of defining these types of information and linking them to the ACMG guideline criteria they are used to support. Case-Control data are one example of a key data type they define, which is relevant for the PS4 criteria. And of course this is why 'Case-Control Annotation' appeared on our initial list of VA categories.

The proposal in this ticket is that case-control is a specific form of a more general type of statement about a variant being more or less common in one type of 'group' vs another (be it cohorts of patients in a clinical trial, or sets ER+ vs ER- tumor specimens). That is not to say that a more specific VA category for case-control manifestations of this idea wont be needed - only that the general idea of 'Relative Population Frequency' as defined above is useful and may take many forms.

So, it is clear that we need to be able to represent this type of info in a standard, structured way - because it is an important type of evidence for variant pathogenicity and predisposition statements. The question then is how?

Approach 1: One option is what is currently proposed - to represent this as a type of variant annotation that can exist independently of its use as evidence, and then be plugged into/referenced in a Variant Predisposition or Pathogenicity annotations as evidence. The ClinGen-SEPIO model does exactly this, and I personally appreciate the modularity of this design/approach.
Approach 2: Another option is defining some other structure for capturing this info in the context of variant pathogenicity and predisposition annotations - but not modeling it as a type of annotation in its own right.

My gut favors approach 1 of splitting this out as another type of annotation - one that captures a statement about the variant that summarizes the findings of a particular study about its relative frequency in two or more groups of things. This statement could then be plugged into or referenced as evidence in the variant predisposition or pathogenicity annotations.

By contrast, the examples I made above represent something more aligned with approach 2 where the relative population frequency data is captured in a flat list, and not explicitly framed as evidence or organized as a statement/annotation in its own right.

The updated example below is refined to (a) follow the ACM framework in naming of fields, as per the proposals here and (b) capture the relative population frequency evidence as a nested 'Relative Population Frequency' statement, rather than a flat list of data values.

Annotation 1(refactored): A Predisposition Annotation (aka Condition Risk) that includes data from two case control studies modeled as Relative Population Frequency statements as evidence supporting it.

id: statement001
type: Variant-Condition Predisposition Statement
subject: NM_007194.3(CHEK2):c.1100delC (p.Thr367Metfs)
predicate: moderate_risk_for
descriptor: Breast Cancer
evidence: [
      {
      id: statement002
      type: 'Relative Population Frequency Statement'
      subject: NM_007194.3(CHEK2):c.1100delC (p.Thr367Metfs)
      . . . (whatever our model is for capturing the descriptor / data items in this type of statement)
     }

      {
      id:  statement003
      type: Relative Population Frequency Statement
      subject: NM_007194.3(CHEK2):c.1100delC (p.Thr367Metfs)
      . . . (whatever our model is for capturing the descriptor / data items in this type of statement)
      }
]

Again, the point here is that including Relative Population Frequency as an annotation type in scope for our model supports this modular approach where we can structure the data needed to describe evidence for variant predisposition and pathogenicity assertions as pre-packaged relative population frequency annotations - that exist independently and can be plugged in or referenced as evidence in the context of these higher-order annotations of clinical significance.

UPDATE: An extended version of this example that traverses down through all three annotation types in a single nested structure is presented in the ticket #10.

Good suggestion Matt. I also favor approach 1. Just one clarification though. What happens when a a predicate differs between sources. An example would be in paperA, this CHEK2 variant was associated with a 'moderate_risk_for' breast cancer. However, paperB comes out with a different ethnic/risk population and finds that this variant is 'high_risk_for'.

Does this mean that there will be multiple statements for this variant, one for each predicate?

Yes, exactly. The assertions made in an annotation represent a claim made by a particular agent at a particular time. So if two sources/agents make the same exact claim about the same variant having high risk for a condition, we would represent two distinct statements in the data, where the evidence and provenance behind each can be separately traced. And the same goes for cases where two sources make different claims about a variant's predispositional association for a condition (e.g. one says high risk and one says low risk) - these are two separate statements that would be separately represented in the data, along with their respective evidence and provenance information.

ga4gh / va-spec

Generalize the 'Case-Control' Annotation Type #2

Proposal: