Closed cmwilhelm closed 1 year ago
fyi also @geli-gel
@geli-gel your changes ended having had no effect on performance. That was a false lead based on error in my setup in the spp-client project that we discussed yesterday. As to whether reducing time by 50% is enough... well. It may not be, but it's the best I think I can do. :(
This is an optimization PR, and like all optimizations, it makes the code more confusing, I think. My apologies.
MergeSpans has a quadratic component that bites us very hard in certain circumstances, like automatically converting BoxGroups to SpanGroups. It is also a critical dependency in @comorado 's new tables/figures predictor.
This changeset replaces the nested for-loops in python with a series of operations against numpy matrices to the same end effect. Using
tt profile
and a dataset built from SPP API, I measured a >50% reduction in execution time.Note: it is possible to get this down to n*lgn for the special case of
tokens
, since we have the ncls indices ondoc
. However, this yielded only marginal benefits for the worstn
s we're seeing, and performed slightly worse elsewhere.