Stop sorting determinize powersets unnecessarily [LUCENE-9983]

asfimport commented 3 years ago

Spinoff from #11020.

Today, our Operations.determinize implementation builds powersets of all subsets of NFA states that "belong" in the same determinized state, using this algorithm.

To hold each powerset, we use a malleable SortedIntSet and periodically freeze it to a FrozenIntSet, also sorted. We pay a high price to keep these growing maps of int key, int value sorted by key, e.g. upgrading to a TreeMap once the map is large enough (> 30 entries).

But I think sorting is entirely unnecessary here! Really all we need is the ability to add/delete keys from the map, and hashCode / equals (by key only – ignoring value!), and to freeze the map (a small optimization that we could skip initially). We only use these maps to lookup in the (growing) determinized automaton whether this powerset has already been seen.

Maybe we could simply poach the IntIntScatterMap implementation from HPPC? And then change its hashCode/equals to only use keys (not values).

This change should be a big speedup for the kinds of (admittedly adversarial) regexps we saw on #11020.

Migrated from LUCENE-9983 by Michael McCandless (@mikemccand), updated Sep 30 2021 Pull requests: https://github.com/apache/lucene/pull/162, https://github.com/apache/lucene/pull/163

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

Hi Mike,

So if I understand correctly what we really need is a map that could maps key (which is state) to its count, and remove the state when count goes to 0 while iterating the intervals? And freeze seems to be necessary since we want to make a snapshot of the key set to use it as a hash key?

I'm thinking about using an IntIntHashMap along with a FixedBitSet, so that we keep the count using the map and use the snapshot of the bitset as hash key. What do you think?

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

Oh I realized we're still gonna iterate on those frozen set here so maybe bitset is not a good choice? What about just iterate over the keys and create a FronzenIntSet based on that? Since we're anyway gonna copy those keys so it should only add a little more overhead comparing to the current implementation, while getting the benefit of using a light weight, sort free data structure?

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi @zhaih, thank you for jumping on this :)

[Disclaimer: @zhaih and I both work at Amazon, on customer facing product search, now 100% migrated to Apache Lucene!]

For the record, this code is attempting to implement the Powerset Construction for NFA -> DFA.

I think it's really three different data structures we need.

1 (currently SortedIntSet) Map<int,int> where the key is a state, and the value is its "reference count". We increment/decrement by state here, removing the entry from the map when its ref count drops to 0. This thing is sort of a factory to make 2 below: we periodically call freeze on this map to make a copy of just its keys, a set of ints, producing an instance of 2 below:

2 (currently FrozenIntSet) ** Set<int> to hold a single powerset – this can maybe just be an simple int[] or maybe packed form or something

3 (currently HashMap<IntSet,Integer> Map<Set<int>,int> – the keys in this set are the Set<int> from 2 and the values are int state numbers for the eventual (output) determinized automaton.

For 3 we need hash/equals on 2 and that is why we sort 1's keys today. But that sorting is costly, and I think a big part of the cost for regepxs like the one in #11020 and likely even for non-adversarial regexps too.

If we stop the sorting, and use something like HPPC's IntHashSet (except, since this is Lucene's core, we cannot take a hard dependency, so we would need to poach/fork/specialize that sources) I think we can have O(N) hashCode() and equals(), same big-oh cost as those operations have today? And reduce the O(N * log(N)) we are paying for 1 today.

asfimport commented 3 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I generally agree but I'd like to confirm that it's actually the sorting that costly here before trying to optimize. The algorithmic complexity is one thing but if these sets are short (and they will be, right?) then it's a small constant. What you gain is comparisons are fast later on on collisions. An alternative representation (sets) will require more memory and comparing sets will be slower than comparing sorted arrays. It's all a tradeoff. What I'd do first is try to figure out whether the sorting is indeed the key problem here. If it is, switch to sets and see if it makes a difference. Then, as a final optimization, change the hash code of those sets to a long and distribute the hash better to limit the number of collisions - this will limit the number of the required hash set comparisons to a minimum.

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

Thanks @mikemccand and @dweiss. I've opened a PR based on IntIntHashMap: https://github.com/apache/lucene/pull/163

I've applied the test attached in #11020 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states?

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

I've added a simple static counter just for the adversarial test, and here's the stats:

incr called: 106073079
entry added to set: 100076079
decr called: 106069079
entry removed from set: 100072079
computeHash called: 40057
freeze called: 14056

So seems to me my guess above holds, we're doing way more put/remove entry operations than others

asfimport commented 3 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

from 6 min to 16 sec on my local machine

Ouch. That's what I call a nice improvement... I guess you know where to focus the attention now then.

asfimport commented 3 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

How many states are manipulated? If the states are numbered from 0 to N, and we keep most of the states during the computation, or N is not too high, then should we use an array instead of a map? With array[state] is the "reference count". We wouldn't have to sort the set of states for equality check because it would be directly the array order (skipping states with 0 reference).

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In the sort of "opposite extreme" case, where someone calls det on an already "happens to be determinized" NFA (I think we already catch if someone tries to det an Automaton that we already previously det'd, and skip it?), I think we would see much more balanced incr/decr versus freeze?

The algorithmic complexity is one thing but if these sets are short (and they will be, right?) then it's a small constant.

Yeah, +1, they will "usually" be very short sets, I think, in the non-adversarial cases.

I think we are badly missing a "representative" set of "real-world" regexp to use as a benchmarking corpus, to make decisions about optimizations like this. I love that this adversarial regexp go soooo much faster with @zhaih's PR, but I'm worried that it might then make the more normal, real-world, non-adversarial cases slower.

Does anyone know of an existing "corpus" of "real-world" regexps by any chance ;) I will open a dedicated issue for this.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I opened #11025.

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

+1 to have a set of regexps so that we can benchmark them, I'm also a little worried the PR might make the normal cases worse too.

@bruno-roustant That is a good idea, I've tried to use a 128 size array as a map for first 128 states and it doesn't help the adversarial cases (I also pulled out some stats and found in adversarial cases states are actually much more than that number). But I think we might see some benefits from the normal cases once we have benchmark set up.

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

I just realized that we've already had several tasks that is comparing the performance of regexp queries, such as here.

So I've done some benchmarking comparing the PR as well as another commit that is based on PR but with an additional 128 size int array trying to make access to count of first 128 states faster. The result showed that both candidates doesn't show much qps difference (within 1%) when comparing to baseline with "Wildcard" and "Prefix3" tasks.

If the benchmark results are reliable (meaning I didn't mess up with configuration etc.) I think the new PR won't affect the normal case a lot, and additional optimization seems not having visible benefit. So I think it might be better to start with just using IntIntHashMap to make things simpler? I'll update the PR accordingly.

asfimport commented 3 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

@zhaih do you have some stats about the numbers of NFA / DFA states manipulated?

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@zhaih, also realize that the actual execution of the RegexpQuery won't change from the optimizations we are trying here, i.e. the resulting determinized automaton is the same. So I think we really need a newish benchmark that just measures cost of parsing/determinizing the regexp, e.g. simply calling new RegexpQuery("xxx").

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

@bruno-roustant in the adversarial test case, I added 3 static counters for measuring the avg and max size we seen in the set, and result is we're seeing 1800+ states averagely and 24000 states at most. I record the set size each time we call size() (basically each iteration) to calculate the average so it might not be very accurate.

@mikemccand ah thanks my bad, I didn't realize determinize is called at construction time. I'll benchmark that.

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

I constructed a file with 235k words that has some part of it randomly replaced by a regex (like "apple" to "a[pl]*e")

Then warm up 10 rounds and run 20 rounds to measure the average time of constructing RegexpQuery for those words. Here's the results I got:

	Baseline	IntIntHashMap	IntIntWormMap	int[128] + IntIntHashMap
Time	23.55	23.61	23.78	23.69

So in normal case original code and IntIntHashMap only have a very similar performance, other choices all has some kind of performance loss seems.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Whoa, this sounds like a great start at a regexp benchmark! Could you maybe open PR to add that initial set of synthetic regexps into luceneutil?

Separately, it's kinda odd that theIntIntWormMap was not faster – I had high hopes for it. It has such a cool name!

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

Could you maybe open PR to add that initial set of synthetic regexps into luceneutil?

OK, opened one: https://github.com/mikemccand/luceneutil/pull/130

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 48ff29c8f358f4dc4fad48997b8ebfde5d2e5751 in lucene's branch refs/heads/main from Patrick Zhai https://gitbox.apache.org/repos/asf?p=lucene.git;h=48ff29c

LUCENE-9983: Stop sorting determinize powersets unnecessarily (#163)

LUCENE-9983: Stop sorting determinize powersets unnecessarily

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 48ff29c8f358f4dc4fad48997b8ebfde5d2e5751 in lucene's branch refs/heads/main from Patrick Zhai https://gitbox.apache.org/repos/asf?p=lucene.git;h=48ff29c

LUCENE-9983: Stop sorting determinize powersets unnecessarily (#163)

LUCENE-9983: Stop sorting determinize powersets unnecessarily

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit c6b9dd95c9f4b9ab9bac5904988432bba2ad4bd3 in lucene-solr's branch refs/heads/branch_8x from Patrick Zhai https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c6b9dd9

Backport #10182 and LUCENE-9983 (#2517)

10182 Refactor IntSet operations for determinize (#1184)

Co-authored-by: Mike <madrob@users.noreply.github.com>

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@zhaih I think this can be resolved now?

asfimport commented 3 years ago

Patrick Zhai (@zhaih) (migrated from JIRA)

@mikemccand yes we can close it. But it seems I can't close it, could you close it? thank you!

apache / lucene

Stop sorting determinize powersets unnecessarily [LUCENE-9983] #11022

10182 Refactor IntSet operations for determinize (#1184)