ga4gh / va-spec

An information model for representing variant annotations.
15 stars 2 forks source link

Population Frequency Annotation Definition and Scope #38

Open mbrush opened 5 years ago

mbrush commented 5 years ago

Definition: a statement about the observed frequency of a variant in a defined group of individual organisms (or derived biological specimens).

Comments/Considerations: See the document here for a more complete list of considerations and requirements. Below are only the most pressing issues and proposals related to defining an initial pass at the PF Statement model.


Statement Components

Variation:

Population

Frequency-related data items

mbrush commented 5 years ago

ACM-based semantics of the PF statement

There are a few ways we could approach this. The variant is of course the subject The proposals below use the Population as the descriptor, and uses predicate indicating that a variant was or was not observed in the population. The data items that provide quantification of the core observation are captured in the qualifier slot (and/or as evidence metadata). But another valid approach could be to place the frequency data in the descriptor slot, the Population as the qualifier, and use a single predicate like 'is_described_by'. We initially propose the former approach, however, because it is a bit more flexible (e.g. can capture a more foundational statement that a variant exists in a population, with no quantification), and because this semantic structure aligns with how the BioLink model we have talked about models population frequency.


Proposal 1:

Here the study data qualifies (quantifies) the frequency at which the variant was observed in the indicated population. But this frequency data can also be considered evidence supporting the core statement that the variant was or was not observed in the indicated Population. So an alternative approach is to capture the Study Data object holding this data as evidence, instead of qulifying the primary statement itself:

Proposal 2:

Here, the frequency data is not 'semantically' part of the core statement, but the structure connecting the subject variant to this data is the same as in Proposal 1 (i.e. replace the attribute label "evidence" with "qualifier" and you essentially have Approach 1).

Proposal 3:

Finally, a hybrid between these approaches would be to allow for only the core variant frequency value to qualify the statement (since this is primarily what the PF VA types is about), and capture the full set of Study Data as evidence.

I like this because it keeps the actual statement focused on the core piece of data advertised for this VA type - maintaining its purity in a sense, while also providing a full picture of all relevant data that support calculation and interpretation of this frequency statement, as evidence.


Comments:

javild commented 5 years ago

From my point of view, what is really interesting here are the actual frequency values/counts, i.e. I'd go for an approach in which these are part of the main statement. In particular:

frequency data in the descriptor slot, the Population as the qualifier, and use a single predicate like 'has frequency'. In plain language, this structured statement would read something like "Variant X has frequency Y in Population Z"

As I see it, during an analysis one is interested in knowing the actual values rather than whether it has simply been observed or not. Placing the frequency values in the evidence kind of hides these. Evidence data is critical but during an analysis I would expect to very rarely look in Evidence fields - I would expect to look in there very occasionally and just for traceability purposes.

melissacline commented 5 years ago

Two issues:

  1. Echoing the point above, it's important to know if the counts were sufficient for the frequency to be meaningful. The counts could be part of the evidence.
  2. When the variant is not observed in a population, can you determine if this was due to technical artifacts (coverage / read depth / etc)? This could be expressed through the presence or absence of flags in the evidence.
mbrush commented 5 years ago

Decisions and issues from recent calls (see minutes for more):

1. PF Statement structure and semantics:

2. Frequency Study Data object:

3. Evidence and Provenance Modeling

4. Modeling Population themselves

larrybabb commented 5 years ago

@mbrush I apologize that I may have missed some key background to this effort, but I was curious why we would model the storage of the frequency percentages (as floats) when that can be derived from the underlying count data that is critical for folks to have in determining when and if they can utilize the associated annotations. Can you clarify the need to model both the counts that are used to derive the frequency percentages as well as the derived frequency percentages themselves?

rrfreimuth commented 5 years ago

@larrybabb I assumed it was because there may be instances when the frequency (percentage) is known but the underlying counts are not recorded

mbrush commented 5 years ago

@larrybabb I think we want to allow folks to capture counts, derived frequencies, or both. This reflects the real world landscape of data we came across in our requirements work. There are indeed data providers that provide only the frequencies (e.g. GEl/CellBase data here). In this case, as Bob said, we don't know the counts because they are not reported. But I do agree that not knowing the underlying counts limits the utility of the data - and ideally we would always get them. Also, I assume efforts like Cellbase provide the frequencies directly because they are more immediately useful in some applications. So there is benefit to making them available without the need to compute.

javild commented 5 years ago

@larrybabb sometimes it is not trivial to get the correct counts indeed: if you take into account PAR/non-PAR regions, missing genotypes, overlapped variants (e.g. deletion & SNV), and so on... We've had some problems with this in the past. In these cases, if the source is providing already calculated allele frequencies (gnomAD, for example) you might just want to stick to that and obviously refer to the source version

mbrush commented 5 years ago

Current proposal is in modeling spreadsheet here. Reviewed on July 10 VA call - general agreement to move ahead with this model. We also had a nice conversation about whether the PF statement here asserts the freq in the study population, or an extrapolation to global population. We agreed it was the former - i.e. these statements merely report data about the measured study populations/cohorts, and do not represent inferred conclusions beyond this w.r.t. frequency in global population. This is left to data consumer to infer if they see fit to do so. We just provide the study data and metadata to allow them to decide if this extrapolation/inference is warranted. Labels and definitions on the Statement and Population objects have been updated to reflect this.