biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Faster merge #933

Closed wasade closed 1 year ago

wasade commented 1 year ago

Table._fast_merge was performing poorly on large tables. Here we revise the algorithm used.

List based, on large data, we get:

        User time (seconds): 2990.23
        Elapsed (wall clock) time (h:mm:ss or m:ss): 51:55.52
        Maximum resident set size (kbytes): 52802340
wasade commented 1 year ago

Version 2, precomputing nnz:

        User time (seconds): 4814.36
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22:09
        Maximum resident set size (kbytes): 67324388
wasade commented 1 year ago

third version

        User time (seconds): 327.31
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:20.64
        Maximum resident set size (kbytes): 41402608
wasade commented 1 year ago

...output is consistent with the existing method on large data

In [1]: import biom

In [2]: exp = biom.load_table('/qmounts/qiita_data/BIOM/174098/57585_analysis_Metagenomic_Woltkav014DatabasescratchqpwoltkaWoLr2WoLr2
   ...: BIOMpergenebiom.biom')

In [3]: obs = biom.load_table('result.biom')

In [4]: exp
Out[4]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [5]: obs
Out[5]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [6]: exp.matrix_data.data[:10]
Out[6]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [7]: obs.matrix_data.data[:10]
Out[7]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [8]: import numpy as np

In [9]: np.allclose(exp.matrix_data.data, obs.matrix_data.data)
Out[9]: True

In [10]: np.allclose(exp.matrix_data.indptr, obs.matrix_data.indptr)
Out[10]: True

In [11]: np.allclose(exp.matrix_data.indices, obs.matrix_data.indices)
Out[11]: True
wasade commented 1 year ago

cc @ahdilmore @antgonza

antgonza commented 1 year ago

Really nice! Thank you for working on this. Looks good to me but just to confirm, and make sure I'm reading this correctly: if the user has n tables to be merged to a another table, the algorithm will perform much faster if others has all the tables to be merged vs. getting the another table and merging one by one; correct?

wasade commented 1 year ago

That is true here, and with the original implementation as well