Closed jeromekelleher closed 9 years ago
i am working on this
hey @jeromekelleher are callSetIds for variants? or samples?
Joe, a CallSet
has its own reference to a sampleId
:
/**
A `CallSet` is a collection of variant calls for a particular sample.
It belongs to a `VariantSet`. This is equivalent to one column in VCF.
*/
record CallSet {
/** The call set ID. */
string id;
/** The call set name. */
union { null, string } name = null;
/** The sample this call set's data was generated from. */
union { null, string } sampleId;
also
`Variant` and `CallSet` both belong to a `VariantSet`.
`VariantSet` belongs to a `Dataset`.
The variant set is equivalent to a VCF file.
Hope it helps, Paul
cool! thanks. just wanna make sure
Sure, you're welcome :) It took me a while as well to clarify the definitions for myself. Here's a diagram accompanied by a few definitions that clarified things for me (which can be found on http://ga4gh.org/#/api):
A GAVariant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. Variants belong to a GAVariantSet. This is equivalent to a row in VCF.
Hope it helps, Paul
Fixed in #264
We need to support searching by CallSetIds in the HtslibVariantSet. The implementation should roughly mirror the WormtableVariantSet version: maintain a map of callSetIds to sample indexes, and only retrieve the genotypes for these indexes.
See the pysam source for details and documentation: https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx
We also require testing for this, and the case in which we do not specify the callSetIds. We should tetst the results using PyVCF. We must test
testSeachByCallSetIds
intests/unit/test_backends.py
for similar tests.