CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
18 stars 4 forks source link

fix: regression in multiallelic support for `simgenotype` #195

Closed aryarm closed 1 year ago

aryarm commented 1 year ago

In PR #163, we started using the GenotypesRefAlt class in simgenotype to make it consistent with our other tools. Unfortunately, the GenotypesRefAlt class only supports biallelic genotypes. But since simgenotype had previously supported multi-allelic variants, this became a regression.

This PR adds support for multi-allelic variants in the GenotypesRefAlt class and its children by changing the variants property to store a variable-length list of alleles instead of assuming that there are only ever two alleles. This is officially a BREAKING change to the haptools data API, specifically for the GenotypesRefAlt class, the GenotypesPLINK class and the GenotypesAncestry class!

Also, the GenotypesRefAlt class will be officially renamed to GenotypesVCF! It's something I've liked to do for a while and figured we might as well do it now while we're breaking everything, anyway.

Note that this PR does not add support for tandem repeats yet. You'll be able to read and write multi-allelic variants in the GenotypesRefAlt class but not much else besides that. For example, you shouldn't use this class for association analyses of multi-allelic variants because the data property of the class simply stores their index, not their dosage.

aryarm commented 1 year ago

I'm keeping this as a draft PR because there are still a few things to do:

mlamkin7 commented 1 year ago

I'm keeping this as a draft PR because there are still a few things to do:

  • [x] update the haptools data docs
  • [x] add tests for multi-allelic reading and writing
  • [x] figure out desired behavior for handling unphased and non-missing genotypes in simgenotype @mlamkin7

    • [x] and depending on what we decide, we might also want to document the decision?
  • [x] add support for writing missing genotypes in the GenotypesVCF and GenotypesAncestry classes? and add tests for this
  • [x] update the API docs

Melissa thinks its best if we have support for writing missing genotypes and store that instead of removing the sample or variant. For handling unphased we can keep as is right now.

aryarm commented 1 year ago

ok, great - I'll add support for missing genotypes soon

For handling unphased we can keep as is right now.

ok, so we don't want to check that the reference panel is phased, then?

mlamkin7 commented 1 year ago

Correct