[pgenlib] support for `int8_t` in `read_alleles` and `read_alleles_and_phasepresent`

chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.

https://www.cog-genomics.org/plink/2.0/

408 stars 123 forks source link

[pgenlib] support for `int8_t` in `read_alleles` and `read_alleles_and_phasepresent` #234

Closed aryarm closed 1 year ago

aryarm commented 1 year ago

In the read_alleles section of the python_api for Pgenlib it says https://github.com/chrchang/plink-ng/blob/8203270cfc0769264bb884387c33b2eec6b77ba9/2.0/Python/python_api.txt#L85-L87

Our team is working with large datasets like the UK Biobank, and it would be very helpful to load alleles in int8_t instead of int32_t. Is that also outside of the scope of the API? Can you provide any guidance on how to adapt pgenlib to support this?

chrchang commented 1 year ago

Once this type of consideration is important, you shouldn't only be working in Python.

The underlying data representation is based on packed 2-bit values, so it's actually easier to write a function to return packed 2-bit values than to return int8_ts, and most code operating on these packed values is more naturally written in C/C++/Cython than Python.

rborder commented 1 year ago

I think this would be useful even though Python is suboptimal for this kind of data. Specifically, Python is great for prototyping methods that might eventually be implemented in a compiled language and it would be really be handy to be able to read in phased haplotype data for such use cases while keeping the memory requirements low.

aryarm commented 1 year ago

@chrchang, what about if Pgenlib had a method to return packed bit arrays directly, if all of the requested genotypes are biallelic? Pgenlib could load them into an array that could be operated upon later using numpy's elementwise bit operations

or, if anyone wanted to unpack them to an int8, they could just use np.unpackbits

perhaps this is a bit too specific to our use case, though?

in our situation, I was hoping to load a packed bit array into python, perform some indexing and bitwise operations using numpy, and then write the resulting packed bit array to a new pgen file

I'd be happy to fork and try to make a PR for that, if you would be willing to point me in the right direction

chrchang commented 1 year ago

I think this would be useful even though Python is suboptimal for this kind of data. Specifically, Python is great for prototyping methods that might eventually be implemented in a compiled language and it would be really be handy to be able to read in phased haplotype data for such use cases while keeping the memory requirements low.

You can already keep memory requirements low by reading a single variant at a time and converting to int8_t (or even a packed bitarray) before reading the next variant.

rborder commented 1 year ago

You can already keep memory requirements low by reading a single variant at a time and converting to int8_t (or even a packed bitarray) before reading the next variant.

True, though it will be slow to implement in python. But it sounds like this is out of scope for pgenlib? Totally fair if so

chrchang commented 1 year ago

This is out of scope for Python pgenlib, yes, because Python is an awful tool for the job anyway if this amount of additional overhead actually matters.

The most accessible language that allows for reasonably-fast code here seems to be Go. I don't see much academic use, but if that starts to change I'm open to building out a Go pgenlib interface.

rborder commented 1 year ago

Python is an awful tool

Awful convenient!

But how about reading the packed bitarrays in directly as suggested @aryarm?

chrchang commented 1 year ago

The point is that building this directly into Python pgenlib barely even makes a difference. You can already save memory by loading one variant at a time and then packing the output, and this doesn't take much additional code. That additional code is slow, but not as slow as any main loop that would actually operate on the data. This goes double if you attempt any bit-packing.