apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

fix algorithmic worst-case in regeneration of URL tokenizer [LUCENE-9231] #10271

Open asfimport opened 4 years ago

asfimport commented 4 years ago

For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC).

Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue:

"main" `#1` prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable  [0x00007fb1db3df000]
   java.lang.Thread.State: RUNNABLE
    at jflex.StateSet.add(StateSet.java:218)
    at jflex.NFA.closure(NFA.java:387)
    at jflex.NFA.epsilonFill(NFA.java:410)
    at jflex.NFA.complement(NFA.java:737)
    at jflex.NFA.insertNFA(NFA.java:1029)
    at jflex.NFA.insertNFA(NFA.java:971)
    at jflex.NFA.insertNFA(NFA.java:1029)
    at jflex.NFA.insertNFA(NFA.java:972)
    at jflex.NFA.insertNFA(NFA.java:987)
    at jflex.NFA.insertNFA(NFA.java:988)
    at jflex.NFA.insertNFA(NFA.java:987)
    at jflex.NFA.insertNFA(NFA.java:971)
    at jflex.NFA.insertNFA(NFA.java:1041)
    at jflex.NFA.insertNFA(NFA.java:987)
    at jflex.NFA.insertNFA(NFA.java:971)
    at jflex.NFA.insertNFA(NFA.java:971)
    at jflex.NFA.addRegExp(NFA.java:151)
    at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
    at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
    at jflex.LexParse.do_action(LexParse.java:939)
    at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
    at jflex.Main.generate(Main.java:73)
    at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)

Stacks seem to be typically in jflex.StateSet.add(StateSet.java:218) and jflex.StateSet.complement(StateSet.java:173) and many operations, but always come from addRegExp .. insertNFA .. complement codepath.

Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc)


Migrated from LUCENE-9231 by Robert Muir (@rmuir), updated Feb 18 2020 Attachments: LUCENE-9231_build_check.patch (versions: 2)

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

cc @dweiss @sarowe . I haven't looked at the code or dug in much so far. Only wondering, maybe its a situation where we can sort things first to allow it to run faster (similar to the Daciuk/Mihov builder and FST.Builder in lucene)

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I did investigate it a few days ago. See this discussion, Robert.

https://github.com/jflex-de/jflex/issues/715

I haven't had the time to try Gerwin's suggestions yet.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm sorry I missed your investigation. Thanks for the link and already opening the issue.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I think this was buried deep in comments related to porting jflex to gradle... anyway I was also surprised how much memory it required - that's why I looked into it. Modifying the automaton's definition may help; I really hate that we even have to consider it since this file is so understandable now and will be so hairy once you fold-in prefixes into complex regex expressions... Seems to me that the regexp-NFA/ DFA compiler should be smart enough to handle it...

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah the required memory is killing me. At least I'm at the getDFA part. I had no idea how much RAM it needed. I definitely think the easiest win is to test heap size in the build and simply fail if its not gonna work out for you.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Gradle is already better than ant here:

> Task :lucene:analysis:common:jflexUAX29URLEmailTokenizerImpl
Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and requires tons of memory).
Regenerating JFlex:
  from: /home/rmuir/workspace/lucene-solr/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex
    to: /home/rmuir/workspace/lucene-solr/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.java

I will look into a patch to check for max heap size.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Attached patch adds a helpful check to the build that fails much faster than waiting for some OOM.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My main question is, is the current logic forking with 12G JVM? Sorry, I don't see any explicit fork anywhere, so I am kinda confused.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah, my patch is useless. Gradle is forking here (somehow implicitly in a sneaky way I don't see). Ant is not forking.

So I hit the slow OOM with ant regenerate because that is really where the missing check needs to be, since that task run's in ant's JVM. Gradle behaves in a different way... I just wish I knew where the fork() was being defined :)

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I updated the patch to just be against ant. This way it OOMs fast. The gradle one will already just fork a 12GB jvm, so it doesnt need changes.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't think you can run in-process with the gradle daemon. There is always a delegation to a separate isolated worker subprocess; if the JVM arguments differ from the worker's a new subprocess is launched.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, and if it is a daemon, sometimes leaked permanently. I checked machine that was doing nothing all day: at 1300 it had two leaked gradle daemons sitting there since 0700 doing nothing. it is seriously worse than systemd and pulseaudio combined.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I opened #10272 to take care of the daemon issue. This feature is not ready to be defaulted to "on", it leaks multiple gigabyte+ processes.