Population Frequency Annotation Definition and Scope

mbrush commented 5 years ago

Definition: a statement about the observed frequency of a variant in a defined group of individual organisms (or derived biological specimens).

Comments/Considerations: See the document here for a more complete list of considerations and requirements. Below are only the most pressing issues and proposals related to defining an initial pass at the PF Statement model.

Statement Components

Variation:

The statement subject. Will include Discrete Variation only, Specifically alleles (most common), haplotypes (e.g. here, and look at HLA haplotype resources), genotypes (no explicit examples, but implicit in many allele-frequency annotations), and CNVs (DECIPHER).

Population

Discussion of modeling Populations, and the ascertainment criteria that define them, will be done in ticket - #39
The general consensus here was to define the scope of this concept broadly, but focus initial efforts on ensuring we cover the primary need for populations based on ancestry/race/ethnicity.
General feeling was that we will likely need to define a Population object (as opposed to using codes to capture population, as is common in many PF data sources, e.g. gnomAD:fin)

Frequency-related data items

The primary purpose of this VA statement is to link the subject variation to a set of counts and calculations from a 'study' about the variant's frequency in a specific population.
The list of data types that could potentially be lumped into pop freq statements is fairly long (up to 20+ items - see here).
Modeling these in a Pop Freq statement will need to strike a balance between principle (which could require separate statements for different types of frequency-related data items) and pragmatism (which would allow for a less normalized model that groups data items from a single 'study' together when they describe aspects of the population frequency of a given variant)
Our initial proposal is to group these inside a 'Study Data' object, which can fill a single statement slot. Details of the structure and content of this object will be discussed in ticket #40.

mbrush commented 5 years ago

ACM-based semantics of the PF statement

There are a few ways we could approach this. The variant is of course the subject The proposals below use the Population as the descriptor, and uses predicate indicating that a variant was or was not observed in the population. The data items that provide quantification of the core observation are captured in the qualifier slot (and/or as evidence metadata). But another valid approach could be to place the frequency data in the descriptor slot, the Population as the qualifier, and use a single predicate like 'is_described_by'. We initially propose the former approach, however, because it is a bit more flexible (e.g. can capture a more foundational statement that a variant exists in a population, with no quantification), and because this semantic structure aligns with how the BioLink model we have talked about models population frequency.

Proposal 1:

subject: Discrete Variation (1..1)
predicate: code {was_observed_in, was_not_observed_in} (1..1)
descriptor: Population (1..1)
- a proper object rather than a simple code, to allow detailed descriptions of populations
quantifier: Study Data (0..1)
- a collection of all relevant freq-related data from the study, provided as a quantification of the statement made here

Here the study data qualifies (quantifies) the frequency at which the variant was observed in the indicated population. But this frequency data can also be considered evidence supporting the core statement that the variant was or was not observed in the indicated Population. So an alternative approach is to capture the Study Data object holding this data as evidence, instead of qulifying the primary statement itself:

Proposal 2:

subject: Discrete Variation (1..1)
predicate: code {was_observed_in, was_not_observed_in} (1..1)
descriptor: Population (1..1)
- a proper object rather than a simple code, to allow detailed descriptions of populations
evidence: Study Data (0..1)
- a collection of all relevant freq-related data from the study, provided as a evidence supporting the statement here

Here, the frequency data is not 'semantically' part of the core statement, but the structure connecting the subject variant to this data is the same as in Proposal 1 (i.e. replace the attribute label "evidence" with "qualifier" and you essentially have Approach 1).

Proposal 3:

Finally, a hybrid between these approaches would be to allow for only the core variant frequency value to qualify the statement (since this is primarily what the PF VA types is about), and capture the full set of Study Data as evidence.

subject: Discrete Variation (1..1)
predicate: code {was_observed_in, was_not_observed_in} (1..1)
descriptor: Population (1..1)
- a proper object rather than a simple code, to allow detailed descriptions of populations
frequencyQualifier: Allele Frequency Data Item (0..1)
- or this could be take a float value, if we decide not to model each data value as an object
evidence: Study Data (0..m)
- a collection of all relevant freq-related data from the study, provided as a quantification of the statement here (fine to include the subject variant frequency as well)

I like this because it keeps the actual statement focused on the core piece of data advertised for this VA type - maintaining its purity in a sense, while also providing a full picture of all relevant data that support calculation and interpretation of this frequency statement, as evidence.

Comments:

I perceive a few advantages of capturing the quantifying data as evidence:
- it makes it more acceptable to include 'supporting/secondary' data about ref allele and genotype frequencies as information that may have been evaluated to make the primary assertion of frequency for the subject variant. And keep this secondary/supporting/contextualizing info out of the primary statement.
- it more clearly supports meta-statements that use several study-specific population frequency data sets to make an inferred assertion about the global frequency of a particular variant. Here, the Population object for the root meta-PF statement would represent an actual global population of individuals. And there may be multiple prior PF statements captured as evidence that describe calculated frequencies in specific study-populations (e.g ExAc, ESP, 1000 Genomes) that provided the basis for the inferred frequency asserted in the meta PF statement.
- placing the Study Data outside the statement will let us more richly model the provenance of this data (should we want to provide detailed provenance about the Study that produced it)
It is important for the model to be clear that the statement holds only for the Population of individuals in which the variant was actually interrogated in the study - and not the global population of any person who is of the indicated ethnicity (unless such an inference is explicitly made). Consider the example here which presents various calculations about allele count/frequency from a study of Non-Finnish Europeans (NFE) ascertained in the ExAC dataset. The scope of the 'statement' made in this annotation is limited to the participants in said study, and makes no explicit inferences about the overall frequency of a variant in the global population of NFEs. The structure/semantics of the Population object in the descriptor slot will need to make this clear.

javild commented 5 years ago

From my point of view, what is really interesting here are the actual frequency values/counts, i.e. I'd go for an approach in which these are part of the main statement. In particular:

frequency data in the descriptor slot, the Population as the qualifier, and use a single predicate like 'has frequency'. In plain language, this structured statement would read something like "Variant X has frequency Y in Population Z"

As I see it, during an analysis one is interested in knowing the actual values rather than whether it has simply been observed or not. Placing the frequency values in the evidence kind of hides these. Evidence data is critical but during an analysis I would expect to very rarely look in Evidence fields - I would expect to look in there very occasionally and just for traceability purposes.

melissacline commented 5 years ago

Two issues:

Echoing the point above, it's important to know if the counts were sufficient for the frequency to be meaningful. The counts could be part of the evidence.
When the variant is not observed in a population, can you determine if this was due to technical artifacts (coverage / read depth / etc)? This could be expressed through the presence or absence of flags in the evidence.

mbrush commented 5 years ago

Decisions and issues from recent calls (see minutes for more):

1. PF Statement structure and semantics:

For core PF statement, we will capture the variation as the subject, frequency data as descriptor, and population as qualifier - as this more directly highlights the frequency data. Full proposal is here.
There were no major concerns with the alternate approach where the population is the descriptor and req data is the qualifier (esp if we just use a single ‘was_measured_in’ predicate so we don’t have to worry about the match between observed/not_observed predicates and the frequency values in the Data item). So we will keep this in our back pocket it use cases arise that are best addressed by this model.
For now we will define a single/generic PopFreq statement type that covers any type of pop and any type of variation. T.b.d. if there is utility in specialization for specific types of variations and/or populations.
TO DO: if/how to capture the ref allele for annotation on an alt allele -see #44.

2. Frequency Study Data object:

attributes in this object will collect counts and frequencies of the subject variation, and counts/frequencies of the variation in its homo/hetero/hemizygous state.
ref allele frequencies and frequencies of the variation in sex-specific sub-populations will be captured in different PF statement instances (but potentially linked to related statements about the alt allele - t.b.d. how)
these decisions are reflected in the proposed PF Study Data object here.
no specific data type(s) will be absolutely required to include in a PF statement
data examples can be found here
TO DO: settle on attribute names, and whether total indiv and total variation counts belong here or in population object. And decide whether to recommend/enforce practice of separating freq statements based on exome vs genome sequencing data.

3. Evidence and Provenance Modeling

work in progress - plan to define simple model for v0 prototype release. See #43.

4. Modeling Population themselves

work in progress - see #39

larrybabb commented 5 years ago

@mbrush I apologize that I may have missed some key background to this effort, but I was curious why we would model the storage of the frequency percentages (as floats) when that can be derived from the underlying count data that is critical for folks to have in determining when and if they can utilize the associated annotations. Can you clarify the need to model both the counts that are used to derive the frequency percentages as well as the derived frequency percentages themselves?

rrfreimuth commented 5 years ago

@larrybabb I assumed it was because there may be instances when the frequency (percentage) is known but the underlying counts are not recorded

mbrush commented 5 years ago

@larrybabb I think we want to allow folks to capture counts, derived frequencies, or both. This reflects the real world landscape of data we came across in our requirements work. There are indeed data providers that provide only the frequencies (e.g. GEl/CellBase data here). In this case, as Bob said, we don't know the counts because they are not reported. But I do agree that not knowing the underlying counts limits the utility of the data - and ideally we would always get them. Also, I assume efforts like Cellbase provide the frequencies directly because they are more immediately useful in some applications. So there is benefit to making them available without the need to compute.

javild commented 5 years ago

@larrybabb sometimes it is not trivial to get the correct counts indeed: if you take into account PAR/non-PAR regions, missing genotypes, overlapped variants (e.g. deletion & SNV), and so on... We've had some problems with this in the past. In these cases, if the source is providing already calculated allele frequencies (gnomAD, for example) you might just want to stick to that and obviously refer to the source version

mbrush commented 5 years ago

Current proposal is in modeling spreadsheet here. Reviewed on July 10 VA call - general agreement to move ahead with this model. We also had a nice conversation about whether the PF statement here asserts the freq in the study population, or an extrapolation to global population. We agreed it was the former - i.e. these statements merely report data about the measured study populations/cohorts, and do not represent inferred conclusions beyond this w.r.t. frequency in global population. This is left to data consumer to infer if they see fit to do so. We just provide the study data and metadata to allow them to decide if this extrapolation/inference is warranted. Labels and definitions on the Statement and Population objects have been updated to reflect this.

ga4gh / va-spec