biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Table.collapse performance issue #887

Closed qiyunzhu closed 1 year ago

qiyunzhu commented 1 year ago

I think that the Table.collapse method may not have the best performance in its current form. Woltka calls this method to collapse BIOM tables. I found that it is quite slow when working with large BIOM tables, especially those of functional units (ORFs, KOs, UniRefs, etc.), which are common in shotgun metagenomics. For example, collapsing 2,065,892 ORFs by 25 samples (a very small test dataset) took 14m03.86s and 7.45 GB RAM. When working with real datasets it could be slower.

I did a simple benchmark using the Moving Pictures dataset, which has 545 ASVs mapped to 197 genera, from 34 samples. Using biom-format's collapse, it took 217 ms ± 838 µs.

res = table.collapse(lambda id_, _: map_[id_], norm=False, axis='observation')

I did some research and found that there may be more efficient solutions, at least in certain scenarios. The most straight-forward method may be Pandas' groupby, which is well-optimized and generic. In my benchmark it took 1.37 ms ± 23.8 µs.

res = table.to_dataframe(dense=True).groupby(map_).agg(sum)

I found several articles discussing the underlying mechanisms and alternative implementations of groupby: here, here, and here. By learning from these articles, I also came up with a pure NumPy solution, which isn't as fast as grouby but close. It took 2.38 ms ± 2.7 µs.

keys = [map_[x] for x in table.ids('observation')]
uniq, group = np.unique(keys, return_inverse=True)
res = np.zeros([uniq.shape[0], table.shape[1]])
np.add.at(res, group, table.matrix_data.toarray())

Both the Pandas and the NumPy solutions may suffer from a heavy bottleneck of converting the sparse matrix in a BIOM table into a dense matrix. However, according to these articles, there appears to be a solution utilizing SciPy's sparse matrix. I don't quite understand it, but I guess there may be a solution that does the job inside BIOM without converting.

Attached is my code for benchmarking. @wasade

collapse.ipynb.zip

mortonjt commented 1 year ago

I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles. But if you just need a groupby function, you can replace it with a sparse matrix multiplication which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime.

wasade commented 1 year ago

Collapse is a very generalized function and it is very likely special cases can be well optimized. We have not invested much in optimizing this method either.

Jamie andor Qiyun, would either of you be willing to issue a PR? If not then it's likely this will be pushed off for a while unless the performance concern is a critical application issue. Is it? The largest issue Im aware of is Qiita but this is not its primary bottleneck (ie less bang for the buck to focus on)

On Fri, Dec 23, 2022, 14:36 Jamie Morton @.***> wrote:

I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles. But if you just need a groupby function, you can replace it with a sparse matrix multiplication https://stackoverflow.com/a/39651444 which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime.

— Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/887#issuecomment-1364369245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADTZMT5JRKZRIPT56WJVXTWOYSQPANCNFSM6AAAAAATIC5PJU . You are receiving this because you were mentioned.Message ID: @.***>

mortonjt commented 1 year ago

I can do it sometime in Mar / Apr since I’ve done before (see here: https://github.com/mortonjt/ancomP)

On Fri, Dec 23, 2022 at 5:44 PM Daniel McDonald @.***> wrote:

Collapse is a very generalized function and it is very likely special cases can be well optimized. We have not invested much in optimizing this method either.

Jamie andor Qiyun, would either of you be willing to issue a PR? If not then it's likely this will be pushed off for a while unless the performance concern is a critical application issue. Is it? The largest issue Im aware of is Qiita but this is not its primary bottleneck (ie less bang for the buck to focus on)

On Fri, Dec 23, 2022, 14:36 Jamie Morton @.***> wrote:

I can't comment on the general purpose of the collapse function, since it seems to come with quite a few bells as whistles. But if you just need a groupby function, you can replace it with a sparse matrix multiplication https://stackoverflow.com/a/39651444 which should scale well with sparse data. It doesn't look like that isn't going on in the collapse code, but I would be curious if that could improve the runtime.

— Reply to this email directly, view it on GitHub < https://github.com/biocore/biom-format/issues/887#issuecomment-1364369245 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AADTZMT5JRKZRIPT56WJVXTWOYSQPANCNFSM6AAAAAATIC5PJU

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/887#issuecomment-1364371309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXJ63RC3NOBPVGJGMSTWOYTNXANCNFSM6AAAAAATIC5PJU . You are receiving this because you commented.Message ID: @.***>

wasade commented 1 year ago

This issue conflates one-to-many and one-to-one modes. The former is implicated in the seed of the issue with the woltka collapse; this was a bug and is now fixed in #888.

For one-to-one, while it is true it is possible to write a faster implementation, the data presented here (as well as out-of-band) do not indicate the present performance of one-to-one is a meaningful bottleneck. Until data showing one-to-one is a bottleneck, we will not be pursing further optimization beyond what was already done in #866 and #761.