allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

Speed up merge spans #233

Closed cmwilhelm closed 1 year ago

cmwilhelm commented 1 year ago

This is an optimization PR, and like all optimizations, it makes the code more confusing, I think. My apologies.

MergeSpans has a quadratic component that bites us very hard in certain circumstances, like automatically converting BoxGroups to SpanGroups. It is also a critical dependency in @comorado 's new tables/figures predictor.

This changeset replaces the nested for-loops in python with a series of operations against numpy matrices to the same end effect. Using tt profile and a dataset built from SPP API, I measured a >50% reduction in execution time.

Note: it is possible to get this down to n*lgn for the special case of tokens, since we have the ncls indices on doc. However, this yielded only marginal benefits for the worst ns we're seeing, and performed slightly worse elsewhere.

cmwilhelm commented 1 year ago

fyi also @geli-gel

cmwilhelm commented 1 year ago

@geli-gel your changes ended having had no effect on performance. That was a false lead based on error in my setup in the spp-client project that we discussed yesterday. As to whether reducing time by 50% is enough... well. It may not be, but it's the best I think I can do. :(