Closed aryarm closed 1 year ago
Once this type of consideration is important, you shouldn't only be working in Python.
The underlying data representation is based on packed 2-bit values, so it's actually easier to write a function to return packed 2-bit values than to return int8_ts, and most code operating on these packed values is more naturally written in C/C++/Cython than Python.
I think this would be useful even though Python is suboptimal for this kind of data. Specifically, Python is great for prototyping methods that might eventually be implemented in a compiled language and it would be really be handy to be able to read in phased haplotype data for such use cases while keeping the memory requirements low.
@chrchang, what about if Pgenlib
had a method to return packed bit arrays directly, if all of the requested genotypes are biallelic? Pgenlib
could load them into an array that could be operated upon later using numpy's elementwise bit operations
or, if anyone wanted to unpack them to an int8, they could just use np.unpackbits
perhaps this is a bit too specific to our use case, though?
in our situation, I was hoping to load a packed bit array into python, perform some indexing and bitwise operations using numpy, and then write the resulting packed bit array to a new pgen file
I'd be happy to fork and try to make a PR for that, if you would be willing to point me in the right direction
I think this would be useful even though Python is suboptimal for this kind of data. Specifically, Python is great for prototyping methods that might eventually be implemented in a compiled language and it would be really be handy to be able to read in phased haplotype data for such use cases while keeping the memory requirements low.
You can already keep memory requirements low by reading a single variant at a time and converting to int8_t (or even a packed bitarray) before reading the next variant.
You can already keep memory requirements low by reading a single variant at a time and converting to int8_t (or even a packed bitarray) before reading the next variant.
True, though it will be slow to implement in python. But it sounds like this is out of scope for pgenlib? Totally fair if so
This is out of scope for Python pgenlib, yes, because Python is an awful tool for the job anyway if this amount of additional overhead actually matters.
The most accessible language that allows for reasonably-fast code here seems to be Go. I don't see much academic use, but if that starts to change I'm open to building out a Go pgenlib interface.
Python is an awful tool
Awful convenient!
But how about reading the packed bitarrays in directly as suggested @aryarm?
The point is that building this directly into Python pgenlib barely even makes a difference. You can already save memory by loading one variant at a time and then packing the output, and this doesn't take much additional code. That additional code is slow, but not as slow as any main loop that would actually operate on the data. This goes double if you attempt any bit-packing.
In the
read_alleles
section of the python_api for Pgenlib it says https://github.com/chrchang/plink-ng/blob/8203270cfc0769264bb884387c33b2eec6b77ba9/2.0/Python/python_api.txt#L85-L87Our team is working with large datasets like the UK Biobank, and it would be very helpful to load alleles in
int8_t
instead ofint32_t
. Is that also outside of the scope of the API? Can you provide any guidance on how to adapt pgenlib to support this?