biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Use rng.choice without unpacked in subsample_without_replacement with 64-bit support #935

Closed sfiligoi closed 1 year ago

sfiligoi commented 1 year ago

Drastically reduces the memory needs when sums are large. Also allows sums to be >2^31.

sfiligoi commented 1 year ago

On EMP-style BIOM, the new without_replacement algorithm is about 2x faster (n=1000, on a EPYC 7302 CPU): 33s vs 58s

There is also a small reduction is memory consumption, but it is barely noticeable compared to the rest of the memory use when using the Table object.

For the record, the biom used for testing was mp.90.min25.deblur.withtax.onlytree_ACTUAL_overlap.biom

sfiligoi commented 1 year ago

For BIOM tables with very large per-column sums, it is an enabler; the old code would outright fail if the columns sum was > 2^31. (e.g. rna_integrity_metaG_woltka_wolr2_with_plasmids_per_gene_clean_gene_counts_per_g_stool.biom)

(But just fixing that would not help... a test showed it would have needed over 15.1 TiB of RAM)

The new code has a 2^63 limit, and the memory use is proportional to n, only.

The running time is quite fast for small n, but gets slower with high n (on same CPU as above): n=2000 50s n=6G 20min

wasade commented 1 year ago

Thanks!!