broadinstitute / gamgee

A C++14 library for NGS data formats
http://broadinstitute.github.io/gamgee/
MIT License
40 stars 13 forks source link

Fill holes in Genotype encoding API #373

Open droazen opened 9 years ago

droazen commented 9 years ago

The Genotype::encode_genotype() API currently doesn't allow you to phase only some alleles and not others, and also has no function suitable for encoding multiple genotypes at once with phasing.

Note that if we end up removing the requirement to manually encode genotypes as proposed in https://github.com/broadinstitute/gamgee/issues/371, we will still need to provide the equivalent of the above functionality in VariantBuilder.

MauricioCarneiro commented 9 years ago

We need @vruano and @rpoplin in this discussion. Imagine they were writing a variant caller that is phasing aware. They will need to write these out in a sensible way. There is no good model API neither in Tribble or htslib for that. We need to come up with one and people will follow.

droazen commented 9 years ago

One possibility is to have a simple GenotypeAllele struct -- eg.,

struct GenotypeAllele {
    int32_t allele;
    bool phased;
};

Then clients could pass a vector to the builder instead of a vector. This would remove the need to specify phasing information separately, but the disadvantage is that the user's genotype vector could no longer be encoded in-place and passed to htslib -- we would always have to make a copy.

droazen commented 9 years ago

Spoke with Ryan and he feels that it would be very rare for a tool to need to create genotypes with mixed phasing like 0/1|2, so we definitely shouldn't slow down the common use cases for the sake of allowing this.

One possibility is to continue to have functions that take vector for the unphased case, and if phasing is desired you would either create a vector as proposed above, or pass in a companion vector with the phasing information.

MauricioCarneiro commented 9 years ago

I like what @rpoplin said. So anything that requires crazy phasing we can just do inefficiently. Until it becomes a use case for us. But keep the common case fast.

richardbrownsb commented 9 years ago

Hey, just found this online after a search to do with a requirement of mine. What is the current workflow supposed to be when trying to get phased alleles? I have looked at the set genotype functions, and they all call encode_genotypes with phased set to false. I just can't seem to find the right way to access this functionality.

Also, on a related point, what I am actually trying to achieve is to encode a vcf with some of the samples phased and others not - what is the most efficient way to achieve this?

richardbrownsb commented 9 years ago

I guess what I would really want is a set genotypes utility function, called with an additional vector of bools denoting phasing

richardbrownsb commented 9 years ago

Sorry, just seen on the most recent branch that development has stopped.