JamesYang007 / adelie

A fast and flexible Python package for solving group lasso and elastic net problems.
https://jamesyang007.github.io/adelie/
MIT License
13 stars 0 forks source link

`mmap` large SNP files #67

Closed JamesYang007 closed 3 months ago

JamesYang007 commented 4 months ago

Currently, the sparse nature of the large SNP files allows us to load the entire matrix in memory on a cluster. UK Biobank requires about 150GB for unphased calldata and with ancestry, phased calldata, about 250GB (using the sparse format given by io.snp_unphased and io.snp_phased_ancestry). In practice, there is no dataset that requires more memory than this, so it is very low priority to investigate the use of mmap. But this is the most general thing we can do if we simply run out of memory even with clever representation tricks.