ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 93 forks source link

Searching by CallSetIds not supported by HtslibVariantSet #221

Closed jeromekelleher closed 9 years ago

jeromekelleher commented 9 years ago

We need to support searching by CallSetIds in the HtslibVariantSet. The implementation should roughly mirror the WormtableVariantSet version: maintain a map of callSetIds to sample indexes, and only retrieve the genotypes for these indexes.

See the pysam source for details and documentation: https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx

We also require testing for this, and the case in which we do not specify the callSetIds. We should tetst the results using PyVCF. We must test

  1. That we get back the correct information for each genotype;
  2. We correctly get all genotypes when we specify no callSetIds;
  3. Check that we get back the correct calls in the correct order when we specify some callSetIds. See testSeachByCallSetIds in tests/unit/test_backends.py for similar tests.
shajoezhu commented 9 years ago

i am working on this

shajoezhu commented 9 years ago

hey @jeromekelleher are callSetIds for variants? or samples?

pgrosu commented 9 years ago

Joe, a CallSet has its own reference to a sampleId:

/**
A `CallSet` is a collection of variant calls for a particular sample.
It belongs to a `VariantSet`. This is equivalent to one column in VCF.
*/
record CallSet {

  /** The call set ID. */
  string id;

  /** The call set name. */
  union { null, string } name = null;

  /** The sample this call set's data was generated from. */
  union { null, string } sampleId;

also

`Variant` and `CallSet` both belong to a `VariantSet`.
`VariantSet` belongs to a `Dataset`.
The variant set is equivalent to a VCF file.

Hope it helps, Paul

shajoezhu commented 9 years ago

cool! thanks. just wanna make sure

pgrosu commented 9 years ago

Sure, you're welcome :) It took me a while as well to clarify the definitions for myself. Here's a diagram accompanied by a few definitions that clarified things for me (which can be found on http://ga4gh.org/#/api):

variants

A GAVariant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. Variants belong to a GAVariantSet. This is equivalent to a row in VCF.

Hope it helps, Paul

dcolligan commented 9 years ago

Fixed in #264