Closed aryarm closed 9 months ago
@mlamkin7, what do you think of this? Can you double-check that I managed to make changes in all of the places that I needed to? Should I try to add any tests, since this PR doesn't really add any? Is there anything else I can try to do to make it more obvious that the samples will not follow the requested ordering or catch that mistake and warn about it before it happens?
@mlamkin7, what do you think of this? Can you double-check that I managed to make changes in all of the places that I needed to? Should I try to add any tests, since this PR doesn't really add any? Is there anything else I can try to do to make it more obvious that the samples will not follow the requested ordering or catch that mistake and warn about it before it happens?
closes #228 and overrides it
background
The
Genotypes.read
andGenotypes.__init__
methods expect samples to be provided as a list but make no guarantee that the samples will be loaded in the ordering requested. (This is because the underlying libraries that we use do not make this guarantee either.)We first attempted to solve this issue in #225 by automatically reordering after reading the genotypes. But then we realized this would double the memory of the
read()
method.PR #228 takes a different approach to the problem by adding a
reorder_samples
argument to theGenotypes
classes that would do the reordering only if it is requested. This approach would also have required us to do more work to implement extra checks (see https://github.com/CAST-genomics/haptools/pull/228#issuecomment-1816860996). It also got us asking, from a design perspective, what makes thesamples
argument any different than thevariants
argument, which is currently typed as a set?proposal
We ultimately decided that a better design approach would be to just copy the type of the
variants
argument and make thesamples
argument into a set, as well. If a user passes anything that isn't a set, we warn them and request that they use thesubset()
method after loading the genotypes. This approach ended up being a bit cleaner since the GenotypesPLINK classes actually need it to be a set, anyway.