Modeling 'Study Data' as containers for data items

We want to be able to structure a set of data items that is generated by particular study, so that the data can collectively be can be linked to Statements about it, provenance information supporting it, and assertions that rely on it as evidence.

The initial impetus for this in the VA work is to collect the various data items related to the frequency of a variant in a population into a single object that can be described in a Population Frequency statement (see #38). It may be useful to specialize a generic 'Study Data' type to support specific needs of data from different types of studies (e.g. 'Population Frequency Study Data')

Definition:

An object representing a set of data items that were generated by a particular study.

Considerations/Requirements:

This grouping adds one level of nesting between the variant and the frequency data (relative to the flatter CellBase model) - but I think this is acceptable.
In its simplest form, the model is simply a flat list of data-type specific attributes, and their values.
We should also consider how to capture the provenance of these data items. Here we might link out to an object representing the Study that generated them.
As we define the model here, we should review all VA types to consider where else the 'Study Data' model may be useful, and consider requirements from each use case so our model is generalizable to other VA types. At first glance, this may include Population Frequency, Relative Population Frequency, Condition Co-Segregation, and Experimental Functional Impact.

For the Population Frequency Study Data use case in particular, we would define a 'Population Frequency Study Data' type as a specialization. Then we need to define the model/constraints:

A key question here is deciding which of all possible freq-related data item types should be part of a Pop Freq Study Data object. Several candidates from the list here are arguably out of scope because they describe variation other than the subject variant (e.g. genotypes or ref allele), or they describe the frequency of the subject variation in a different population (e.g. the sex-specific frequency calculations). Rather than force these into an annotation that is about a different variant or population, a principled and normalized approach would capture them in separate PF Statements (and potentially consider if/how to link these as being related and useful to view together with the primary variation data in a message).
We also need to decide on attribute- vs data object- based approach to capturing the type of each data item. Do we create separate, data-type specific attributes for each type of data item, or do we represent each data item as an object that we assign a type to to reflect its specific data type (e.g. 'allele frequency data item') such that we can avoid stuffing these semantics into the attribute names.

Re: Point 1 above - it really breaks down into deciding if three categories of freq-related data items should be allowed in a PF statement:

ref allele counts/freqs (e.g. refAlleleCount, refAlleleFreq attributes form the list linked above)
genotype counts/freqs (e.g. refHomoGenotypeFreq, hetGentypeFreq attributes)
freqs/counts for sex-specific sub-pops (e.g. maleAltAlleleFreq, femaleRefAlleleFreq attributes)

If we decide not to collapse these into the same statement as the freq data about the actual subject variant, they would instead be captured in separate statements with the relevant variation (i.e. ref allele or genotype) as the subject, or the relevant sub-population as the qualifier/descriptor.

Re: the 'collapsed approach': There is nothing specific in the notion of 'Study Data' that precludes collapsing these three types of info together with the freq data for the subject variation in a single Study Data object. Allowing for all of these data types/attributes makes it easier for sources that currently collapse all these into one annotation/statement to produce compliant data. And by placing all this related info in one place it makes for faster human consumption of the message (at least for the particular use case that this collapsed model was designed for).

That said, it violates some of our core principles related to normalization and atomic nature of VA statements, and can leads to 'irregularity' in the data - where the same info could be captured in different ways. For example, data related to the frequency of a ref allele or genotype could exist as a proper statement with the ref allele or genotype as the subject, or this same data could be implicitly represented inside an annotation about a different variation (i.e. the alt for the ref, or an allele contained in the genotype). The collapsed approach also bakes into the model assumptions about how one set of users finds it convenient to group disparate data - which may reduce its utility for other use cases.

While I naturally favor a normalized approach that separates things out into more atomic statements, I think the pragmatic solution may be a compromise between full normalization and full collapse - both to reduce burden on creators of the data, and support consumers in seeing related data together in the message. Specifically, I might support a model that allows the genotype data items/attributes in the allele frequency statement. This is because these can be considered data about the subject allele that describes the genomic context in which they were observed in the population. And because I suspect creating de novo genotype objects to use as the subjects of separate statements describing the genotype frequency data places too high a burden on data creators. In contrast however, I would prefer to put the ref allele frequency data in a separate statement (because this is a different variation than the statement subject at the allele level, and should star in its own statement). Similarly, I would prefer to capture frequency data about the subject variation in sex-specific subpopulations in a separate PF statement as well.

On point 2 above, I like more data-type specific attributes for each data item since it's more comfortable to use. In particular, if we agree on a list of relevant data types which are clearly useful for most use cases and which we could expect to be relatively stable, then I'd clearly go for data-specific attributes.

@javild re: point 2 - I think I concur with you here. Creating data objects for each value would significantly bloat the message, and isn't worth it unless there is a use case for needing to describe features of each data item. But couple folks indicated a preference for data objects in the gdoc, so we should hear form them here (Bob and Steven H I think).

As a side note, point 1 above gets at a more general question of what group of consumers we should primarily be designing our schema/message structure for (as requirements for each are often in conflict):

Human Data Developers who will be creating and reading these messages in their native form.
Human Informaticians who will be accessing the data through APIs for use in their application or analyses, and need to understand what they mean so they can code against it.
Computational tools that will operate on the data (query, analysis, slicing/parsing into human friendly views for UIs)
Domain experts (less-technical clinicians, researchers) who want to apply the knowledge contained in the messages (but will likely not be reading the raw/native messages directly).

We each probably have our biases here, and it would be nice to be transparent about this, appreciate the relevance of each perspective, and agree on what our priorities are here. That doesn't mean always making the design choice best for that perspective, but may help focus our attention on the bigger picture when our biases creep in.

I think this all makes sense to me

Specifically, I might support a model that allows the genotype data items/attributes in the allele frequency statement. This is because these can be considered data about the subject allele that describes the genomic context in which they were observed... In contrast however, I would prefer to put the ref allele frequency data in a separate statement (because this is a different variation than the statement subject at the allele level, and should star in its own statement). Similarly, I would prefer to capture frequency data about the subject variation in sex-specific subpopulations in a separate PF statement as well.

what group of consumers we should primarily be designing our schema/message structure for

From my point of view I'd say 1.-2. understanding it as more of an exchange model. That might not necessarily prevent it being used on use case 3 on certain occasions. Definitely not 4 Would be interesting to hear opinions from people on the calls.

A proposed Population Frequency Study Data object model. This proposal assumes a relatively normalized approach wherein frequency data about the reference allele, genotypes, and sub-populations are captured in separate PF statements. It also explores a simple model for capturing MAF/FAF (see #42).

Generic attributes

id: string
type: Class
label: string
description: string

Core Frequency data types

totalIndividualCount: int (0.. 1)
totalVariationCount: int (0..1)
variationCount: int (0..1)
variationFrequency: float (0..1)
homozygousIndividualCount: int (0..1) # or name this homozygousVariationCount?
homozygousIndividualFrequency: float (0..1)
heterozygousIndividualCount: int (0..1)
heterozygousIndividualFrequency: float (0..1
hemizygousIndividualCount: int (0..1)
hemizygousIndividualFrequency: float (0..1)
dosageSensitivitySampingProbability: float (0..1)

Note: I updated attribute names after June 12 call to ensure they accommodate all possible subject variation types (allele, haplotype, genotype, CNV) - by using 'variation' instead of 'allele' in the labels. another option is to provide variation type specific attribute names, if this is clearer for users (e.g. 'haplotypeCount', haplotypeFrequency, 'genotypeCount', etc.)

Exploratory MAF/FAF attributes

isMinorAllele: boolean (0..1)
isFounderAllele: boolean (0..1)

. . .

If sex-specific sub-population frequency data were included, we would have to add some or all of the following attributes:

maleIndividualCount: int (0.. 1) maleVariationCount: int (0..1) maleTotalVariationCount: int (0..1) maleVariationFrequency: float (0..1) maleHomozygousVariationCount: int (0..1) maleHomozygousVariationFrequency: float (0..1)

femaleIndividualCount: int (0.. 1) femaleVariationCount: int (0..1) femaleTotalVariationCount: int (0..1) femaleVariationFrequency: float (0..1) femaleHomozygousVariationCount: int (0..1) femaleHomozygousVariationFrequency: float (0..1)

Next step is to discuss, and test this model against data examples and requirements:

1. ClinGen: data from the record here. showing the 'NC_000006.12 131851228 C' allele in a Non-Finnish European population form the GENIUS T2D dataset (and also the ref for this alt).

Source Data:

    {
      "id": "CGEX:AllFreq034",
      "type": "PopulationAlleleFrequencyStatement",
      "ascertainment": {
        "id": "CGEX:Ascrt0002",
        "label": "GENIUS T2D Cases"
      },
      "allele": {
        "id": "CAR:CA123287",
        "type": "CanonicalAllele",
        "relatedContextualAllele": "CGEX:CtxAll027"
      },
      "alleleCount": 1198,
      "alleleNumber": 7344,
      "individualCount": 3672,
      "homozygousAlleleIndividualCount": 113,
      "heterozygousAlleleIndividualCount": 972,
      "population": {
        "id": "GNOMAD:nfe",
        "label": "Non-Finnish European"
       }  
      ],
      "@context": "http://dataexchange.clinicalgenome.org/interpretation/json/context"
    }

VA Data:

To Do

2. CellBase: data from record here - for the CM000681.2 g.45411941T>C allele in one population from gnomAD (and also the ref for this alt).

Source Data:

{
    "study": "GNOMAD_EXOMES",
    "population": "FIN",
    "refAllele": "T",
    "altAllele": "C",
    "refAlleleFreq": 0.79147565,
    "altAlleleFreq": 0.20852435,
    "refHomGenotypeFreq": 0.6279271,
    "hetGenotypeFreq": 0.32709703,
    "altHomGenotypeFreq": 0.04497584
}

VA Data:

 To Do

Wondering if "variation" and/or "individual" terms are actually redundant and could be removed, i.e

totalIndividualCount: int (0.. 1) totalVariationCount: int (0..1) count: int (0..1) frequency: float (0..1) homozygousCount: int (0..1) homozygousFrequency: float (0..1) heterozygousCount: int (0..1) heterozygousFrequency: float (0..1 hemizygousCount: int (0..1) hemizygousFrequency: float (0..1) dosageSensitivitySampingProbability: float (0..1)

Also wondering if totalIndividualCount: int (0.. 1) totalVariationCount: int (0..1) belong here or rather to the Population model

Discussed on July 10 VA call, and agreed to move ahead with model the specified here. A few items on which to get final feedback.

dosageSensitivitySamplingProbability: somebody please review the definition and comments on this attribute in the spreadsheet here, and make appropriate changes.
filterAlleleFrequency: this new attribute was added based on the suggestion that this threshold data item defined by gnomAD be included in the model. See proposed definition and model here. @AmandaSpurdle please review/edit/comment.
sourceMaterial - to be discussed on ad hoc call about seq provenance and metadata/metrics - relates to provenance. Exome vs genome is key thing to be sure is captured. Could be as part of the Study Data object, or a proper Study Object if we decide to create one. Could be included in a description of a proper biospecimen object if we decide to create one (here we could use biospecimen modeling form existing GA4GH models).

ga4gh / va-spec

Modeling 'Study Data' as containers for data items #40

Definition:

Considerations/Requirements: