CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
19 stars 4 forks source link

ref: use binary search in `Breakpoints.population_array()` #123

Closed aryarm closed 2 years ago

aryarm commented 2 years ago

Note: this PR must be merged after #122


Overview

This PR refactors the Breakpoints class. Internally, it will now use binary search (via np.searchsorted()) when outputting an array suitable for transform. I also created a new set of methods (encode() and recode()) which can be used within the Breakpoints class to encode the population labels as integers or vice versa. There is also a new write() method to help with writing breakpoints files.

In the process of doing all of this, I may have discovered a bug with the old population_array() method! I didn't catch it before because my tests were written incorrectly. This PR fixes both the bug and the tests.

In addition, I added a bunch of documentation for the Breakpoints class within the API docs for the data module.

Future work

We may want to consider storing everything in a flat numpy array instead of having a dictionary of them. The current data structure for the data property kinda complicates things and makes it harder to broadcast operations. Instead, we could consider using a single array but where sample and strand are additional fields in the array.