CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
18 stars 4 forks source link

feat: convert `samples` arg in `Genotypes` classes into a set #230

Closed aryarm closed 9 months ago

aryarm commented 9 months ago

closes #228 and overrides it

background

The Genotypes.read and Genotypes.__init__ methods expect samples to be provided as a list but make no guarantee that the samples will be loaded in the ordering requested. (This is because the underlying libraries that we use do not make this guarantee either.)

We first attempted to solve this issue in #225 by automatically reordering after reading the genotypes. But then we realized this would double the memory of the read() method.

PR #228 takes a different approach to the problem by adding a reorder_samples argument to the Genotypes classes that would do the reordering only if it is requested. This approach would also have required us to do more work to implement extra checks (see https://github.com/CAST-genomics/haptools/pull/228#issuecomment-1816860996). It also got us asking, from a design perspective, what makes the samples argument any different than the variants argument, which is currently typed as a set?

proposal

We ultimately decided that a better design approach would be to just copy the type of the variants argument and make the samples argument into a set, as well. If a user passes anything that isn't a set, we warn them and request that they use the subset() method after loading the genotypes. This approach ended up being a bit cleaner since the GenotypesPLINK classes actually need it to be a set, anyway.

aryarm commented 9 months ago

@mlamkin7, what do you think of this? Can you double-check that I managed to make changes in all of the places that I needed to? Should I try to add any tests, since this PR doesn't really add any? Is there anything else I can try to do to make it more obvious that the samples will not follow the requested ordering or catch that mistake and warn about it before it happens?

mlamkin7 commented 9 months ago

@mlamkin7, what do you think of this? Can you double-check that I managed to make changes in all of the places that I needed to? Should I try to add any tests, since this PR doesn't really add any? Is there anything else I can try to do to make it more obvious that the samples will not follow the requested ordering or catch that mistake and warn about it before it happens?