RegExp.toAutomaton high memory use [LUCENE-6046]

asfimport commented 9 years ago

When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java.

The following caused an OutOfMemoryError with a 32gb heap:

new RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();

When increased to a 60gb heap, the following exception is thrown:

  1> java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623)
  1>     __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
  1>     org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
  1>     org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
  1>     org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
  1>     org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
  1>     org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
  1>     org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
  1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
  1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)

Migrated from LUCENE-6046 by Lee Hinman (@dakrone), resolved Nov 04 2014 Attachments: LUCENE-6046.patch (versions: 3)

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, two bugs here.

First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM and CPU, doing minimize after each recursive operation, in order to build a DFA in the end. It's unfortunately quite easy to concoct regular expressions that make it consume ridiculous resources. I'll look at this example and see if we can improve it, but in the end it will always have its "adversarial cases" unless we give up on making the resulting automaton deterministic, which would be a very big change.

Maybe we should add adversary defenses to it, e.g. you set a limit on the number of states it's allowed to create, and it throws a RegExpTooHardException if it would exceed that?

Second off, ArrayUtil.oversize has the wrong (too large) value for MAX_ARRAY_LENGTH, which is a bug from #6906. Which JVM did you run this test on?

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Just a note – Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners. http://swtch.com/\~rsc/regexp/regexp1.html

(There is no clear winner – both DFAs and NFA have advantages.)

asfimport commented 9 years ago

Lee Hinman (@dakrone) (migrated from JIRA)

@mikemccand I ran it with the following JVM:

java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners.

Thanks Dawid, these are great.

Switching to NFA based matching would be a very large change ... I don't think we should pursue it here. Terms.intersect implementation for block tree is already very complex ... though I suppose of we could hide the "on the fly subset construction" (and convert regexp to a Thompson NFA) under an API, then Terms.intersect implementation wouldn't have to change much.

Still, there will always be adversarial cases no matter which approach we choose. I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Michael McCandless I ran it with the following JVM:

Thanks @dakrone.

I was wrong about the first bug: there is no bug in ArrayUtil.oversize. That exception just means RegExp is trying to create a too-big array ... so just the one bug :)

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I didn't mean to imply we should change the regexp implementation! :) This was just a pointer in case somebody wished to understand why regexps can explode. I actually wish there was an NFA-based regexp implementation for Java (with low-memory footprint) because this would make concatenating thousands of regexps (e.g., for pattern detection) much easier.

asfimport commented 9 years ago

Lee Hinman (@dakrone) (migrated from JIRA)

I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that.

For what it's worth, I think this would be a good solution for us, much better than silently (from the user's perspective) freezing and then hitting an OOME.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

I'm working on a first cut of something that does that. Better regex implementation would be great but the biggest thing to me is being able to limit the amount of work the determinize operation performs. Its such a costly operation that I don't think it should ever be really abstracted from the user. Something like having determinize throw a checked exception when it performs too much work would make you keenly aware whenever you might be straying into exponential territory.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I boiled down the adversarial regexp to this simpler still-adversarial version: [ac]*a[ac]{50,200}

I suspect this is a legitimate adversary and not a bug in our RegExp/automaton impl, i.e. the number of states in the DFA for this is exponential as a function of the 50/200.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

Oh yeah, its totally running into 2^n territory legitiately here. This is totally something that'd be rejected by a framework to prevent explosive growth during determination.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

First cut at a patch. Adds maxDeterminizedStates to Operations.determinize and pipes it through to tons of places. I think its important never to hide when determinize is called because of how potentially heavy it is. Forcing callers of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton might be determinized during those processes.

I added an unchecked exception for when the Automaton can't be determinized within the specified number of state but I'm really tempted to change it to a checked exception to make it super duper obvious when determinization might occur.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

Oh - I'm still running the solr tests against this. I imagine they'll pass as they've been running fine for 30 minutes now but I should throw that out there in case someone gets them to fail with this before I do.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, tests pass. I added a required "int maxStates" to RegExp.toAutomaton, and it threads this down to determinize, and throws RegExpTooHardExc if determinize would need to exceed that limit.

I didn't make it a checked exc; I had started that way but it percolates up high, e.g. into query parsers, and I think that's too much. The exception message itself should make it quite clear what went wrong at query time.

I also added this as an optional param to RegexpQuery default ctor, and defaulted it to 10000 states, and to QueryParserBase, with the same default.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

Oh no! I wrote a very similar patch! Sorry to duplicate effort there.

I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops, sorry, I didn't see you had a patch here! Thank you.

I like your patch: it's good to make all hidden usages of determinize visible. Let's start from your patch and merge anything from mine in? E.g. I think we can collapse minimizeHopcroft into just minimize...

I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done.

Whoa, which tests needed 1M max states? I worry about passing a 1M state automaton to minimize...

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like the test simplifications, and removing dead code from Operations.determinize.

Can we fix the exc thrown from Regexp to include the offending regular expression, and fix the test to confirm the message contains it? Maybe also add RegExp.toStringTree? I found it very useful while debugging the original regexp :)

I think QueryParserBase should also have a set/get for this option?

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

TestDeterminizeLexicon wants to make an automata that accepts 5000 random strings. So 10,000 isn't enough states for it. I'll drop the default limit to 10,000 again and just feed a million to that test case.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

I'll certainly add the regexp string to the exception message. And I'll merge the toStringTree from your patch into mine if you'd like.

Yeah - QueryParserBase should have this option too.

The thing I found most useful for debugging this was to call toDot on the automata before and after normalization. I just looked at it and went, oh, of course you have to do it that way. No wonder the states explode. And then I read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from my rusty CS degree.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

Next version with fixes based on Mike's feedback.

asfimport commented 9 years ago

Nik Everett (@nik9000) (migrated from JIRA)

A couple of updates: This affects version 4.9 as well. Probably all versions. But its impact is likely minor enough to only be worth adding to the 4.10 line.

A found a few test cases that need lots and lots of states. Any time you feed a couple hundred random unicode words to the automata you'll end up needing more than ten thousand states. I've updated those tests to ask for a million states and they caught a few places where I hadn't been as diligent in piping maxDeterminizedStates through.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Nik, your patch looks great; I'll fold in some more minor things from my patch and commit!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636716 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1636716

LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636728 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1636728

LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636758 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1636758

LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636759 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1636759

LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636762 from @mikemccand in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1636762

LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Nik!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636830 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1636830

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636831 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1636831

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636832 from @mikemccand in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1636832

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637054 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1637054

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637055 from @rmuir in branch 'dev/branches/branch_5x' https://svn.apache.org/r1637055

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637056 from @rmuir in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1637056

LUCENE-6046: let this test determinize massive automata

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637078 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1637078

LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637080 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1637080

LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637082 from @mikemccand in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1637082

LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close after 5.0 release.

apache / lucene

RegExp.toAutomaton high memory use [LUCENE-6046] #7108