Closed aryarm closed 1 year ago
I'm keeping this as a draft PR because there are still a few things to do:
simgenotype
@mlamkin7
GenotypesVCF
and GenotypesAncestry
classes? and add tests for thisI'm keeping this as a draft PR because there are still a few things to do:
- [x] update the haptools data docs
- [x] add tests for multi-allelic reading and writing
[x] figure out desired behavior for handling unphased and non-missing genotypes in
simgenotype
@mlamkin7
- [x] and depending on what we decide, we might also want to document the decision?
- [x] add support for writing missing genotypes in the
GenotypesVCF
andGenotypesAncestry
classes? and add tests for this- [x] update the API docs
Melissa thinks its best if we have support for writing missing genotypes and store that instead of removing the sample or variant. For handling unphased we can keep as is right now.
ok, great - I'll add support for missing genotypes soon
For handling unphased we can keep as is right now.
ok, so we don't want to check that the reference panel is phased, then?
Correct
In PR #163, we started using the
GenotypesRefAlt
class insimgenotype
to make it consistent with our other tools. Unfortunately, theGenotypesRefAlt
class only supports biallelic genotypes. But sincesimgenotype
had previously supported multi-allelic variants, this became a regression.This PR adds support for multi-allelic variants in the
GenotypesRefAlt
class and its children by changing thevariants
property to store a variable-length list of alleles instead of assuming that there are only ever two alleles. This is officially a BREAKING change to the haptools data API, specifically for theGenotypesRefAlt
class, theGenotypesPLINK
class and theGenotypesAncestry
class!Also, the
GenotypesRefAlt
class will be officially renamed toGenotypesVCF
! It's something I've liked to do for a while and figured we might as well do it now while we're breaking everything, anyway.Note that this PR does not add support for tandem repeats yet. You'll be able to read and write multi-allelic variants in the
GenotypesRefAlt
class but not much else besides that. For example, you shouldn't use this class for association analyses of multi-allelic variants because thedata
property of the class simply stores their index, not their dosage.