Add a PatternTokenizer that uses Lucene's RegExp implementation [LUCENE-7465]

asfimport commented 8 years ago

I think there are some nice benefits to a version of PatternTokenizer that uses Lucene's RegExp impl instead of the JDK's:

Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp is attempted the user discovers it up front instead of later on when a "lucky" document arrives
It processes the incoming characters as a stream, only pulling 128 characters at a time, vs the existing PatternTokenizer which currently reads the entire string up front (this has caused heap problems in the past)
It should be fast.

I named it SimplePatternTokenizer, and it still needs a factory and improved tests, but I think it's otherwise close.

It currently does not take a group parameter because Lucene's RegExps don't yet implement sub group capture. I think we could add that at some point, but it's a bit tricky.

This doesn't even have group=-1 support (like String.split) ... I think if we did that we should maybe name it differently (SimplePatternSplitTokenizer?).

Migrated from LUCENE-7465 by Michael McCandless (@mikemccand), resolved Feb 13 2017 Attachments: LUCENE-7465.patch (versions: 2)

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch.

I added the SimplePatternTokenizerFactory as well.

SimplePatternTokenizer takes either a String (parses as a regexp and compiles it) or a DFA (expert user who pre-built their own automaton).

I folded in a nice idea from @rmuir to optimize the ascii code points even when using CharacterRunAutomaton.

It's quite fast, \~46% faster than PatternTokenizer when tokenizing 1 MB chunks from the English Wikipedia export, using a simplistic whitespace regexp \[^ ]+.

And it's nice that it doesn't read the entire input into heap!

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another iteration, adding SimplePatternSplitTokenizer. It's surprisingly different from the non-spit case, and sort of complex :) But it does pass its tests. I haven't compared performance to PatternTokenizer with group -1 yet. I'll see if I can simplify it, but I think this is otherwise close.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Interesting that it's faster than PatternTokenizer! I haven't looked at the patch, Mike, I did some experiments recently with regexp benchmarking (for our internal needs) and fairly large regular expression patterns (over an even larger inputs). The native java pattern implementation always won by a large (and I mean: super large) margin over anything else I tried. I tried brics, re2 (java port), re2 (native implementation), Apache ORO (out of curiosity only, it didn't pass correctness tests for me). Brics wasn't too bad, but the gain from early detection of "too hard" DFA expressions was overshadowed by DFA expansion (very large automata in our case), so unless you don't have control over the patterns (in which case adversarial cases can be executed), it didn't make sense for me to switch.

Also, the fact that the java implementation was fast was quite surprising to me as we had a large number of alternatives in regular expressions and I thought these would nicely yield to automaton optimizations (pullup of prefix matching, etc.). In the end, it didn't seem to matter. So perhaps the performance is a factor of how complex the regular expressions are (and how they're benchmarked)? Don't know.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple". Then I wonder if we might detect circumstances in which this new implementation is preferable? The motivation for one factory is similar to the WhitespaceTokenizer "rule" param. People know they have a regexp and want to tokenize on it... they could easily overlook a different factory than the one that is already named well and may appear in blogs/docs/examples.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@dweiss is this a benchmark I could try to run? My regexp was admittedly trivial so it would be nice to have a beefier real-world regexp to play with ;)

The bench is also trivial (I pushed it to luceneutil).

When you tested dk.brics, did you call the RunAutomaton.setAlphabet? This should be a biggish speedup, especially if your regexp has many unique character start/end ranges. In Lucene's fork of dk.briks we automatically do that in the utf8 case, and with this patch, also for the first 256 unicode characters in the full character case.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple".

I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat?

Then I wonder if we might detect circumstances in which this new implementation is preferable?

I think such auto-detection (looking at the user's pattern and picking the engine) is a dangerous. Maybe a user is debugging a tricky regexp, and adding one new character causes us to pick a different engine or something. I think for now it should be a conscious choice?

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I'll try to repeat the experiment with Lucene's regexp when I have a spare moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but it involved regexps with hundreds of alternatives and globs. Definitely not something that people can edit by hand.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe you could share just the regexp :)

But, if you do repeat the test w/ Lucene, then try to use this patch if possible (just the XXXRunAutomaton changes), because it optimizes for code points < 256. Or, if your data is all non-ascii, then don't bother using this patch :)

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat?

Not a required param; default to current java regexp impl; no back-compat worry. It's the most flexible and so I think makes the best default.

I think such auto-detection (looking at the user's pattern and picking the engine) is dangerous.

Ok.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

default to current java regexp impl;

But I think that would mean this new impl would very rarely be used.

I think it's better to give it a separate factory so it has more visibility? If it really does work better for users over time, word will spread, new blog posts/docs written, etc.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

These regexps are generated from the data, so not so easy :) And the data (and the regexps) can contain Unicode characters as well. I'll go back to this, time permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty darn fast, even for large-scale patterns (and inputs). I was in particular surprised at re2 (C implementation) performance being way lower than Java's. Of course there were no adversarial cases in the input.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Hi Mike. Sorry it took me so long. So, check out this example snippet:

  public static void main(String[] args) {
    String [] clauses = "(.*mervi)|(.*hectic)|(petrographic)|(terracing.*)|(3\\.65.*)|(.*mea.*)|(.*n0)|(researchbas)|(chamfer.*)|(.*danaher)|(.*immediacy)|(.*selec)|(.*transi)|(.*photoreaction)|(ceo2)|(asif)|(.*koo.*)|(lasso)|(allis)|(.*paleodata.*)|(needs.*)|(auser)|(micropterus.*)|(.*sdw)|(.*blp.*)|(cent)|(hybridoma)|(tai.*)|(ransac)|(.*gfptag)|(.*falt.*)|(tubular)|(.*closet.*)|(.*halted.*)|(plish.*)|(.*aauw)|(satisf)|(.*kolodn)|(.*glycidyl.*)|(phytodetritu.*)|(.*2r)|(.*remodeler)|(astronomi)|(.*maienschein)|(universityof)|(event\\(s)|(exacerbation)|(leidi.*)|(stemmer.*)|(.*arrow)|(.*domestic)|(.*maq.*)|(pluggable.*)|(scheiner.*)|(interpenetrate)|(.*diving)|(superscript.*)|(.*cherry.*)|(saddlepoint)|(pyrolit.*)|(prosser)|(nyberg)|(iceberg.*)|(.*hammer.*)|(india.*)|(fsa)|(.*x\\(u.*)|(klima)|(good.*)|(.*provid)|(.*streaked)|(.*oppenheimer.*)|(loyalty.*)|(.*caspi.*)|(.*,99)|(.*unaccompanied)|(subharmon)|(.*hillis.*)|(ferment)|(olli)|(.*storybook)|(1358)|(.*savi.*)|(contagion)|(.*freeness)|(.*500m)|(brudvig.*)|(.*genemark)|(.*jahren.*)|(aguirr)|(12345.*)|(.*prolic)|(seafood.*)|(.*remedy)|(.*mildred.*)|(.*bering.*)|(monolithically.*)|(disequilibrium)".split("\\|");
    for (int i = 1; i < clauses.length; i++) {
      String re = Arrays.stream(clauses) 
          .limit(i)
          .collect(Collectors.joining("|"));
      RegExp regExp = new RegExp(re);
      Automaton automaton = regExp.toAutomaton(10000);
      System.out.println("Clauses: " + i + ", states=" + automaton.getNumStates());
    }
  }

As you can see it's essentially a "prefix/suffix/exact" match. Unfortunately this is a very bad example to determinize, so I can't even sensibly benchmark it against other implementations (there can be hundreds or thousands of such clauses). But even this short snippet shows the severe penalty full determinization incurrs – try to run it and you'll see.

Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt.

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

On a happier note, if it's just a union of fixed-strings (a fsa, effectively) you're matching against then it's much much faster with Lucene (and Brics), of course (times in ms.):

JavaRegExpMatcher samples: 100000 time: 11026
JavaRegExpMatcher samples: 100000 time: 11046
JavaRegExpMatcher samples: 100000 time: 11036
LuceneRegExpMatcher samples: 100000 time: 19
LuceneRegExpMatcher samples: 100000 time: 19
LuceneRegExpMatcher samples: 100000 time: 18

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thank you for the example @dweiss. Indeed that's a hard regexp to determinize. It's interesting because the determinization requires many states, yet it minimizes to an apparently contained number of states (though many transitions).

E.g. at 30 clauses, determized form produced 7652 states and 136898 transitions, but after minimize that drops to 150 states and 2960 transitions. I tried to run dot on this FSA but it struggles :)

Net/net the DFA approach is not usable in some cases (like this one); such users must use the JDK implementation. Maybe we should explore an re2j version too.

Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt.

Excellent, I'll fold that in!

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Maybe we should explore an re2j version too.

I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton... Sure, it wouldn't be able to protect against an explosion of states at compile-time, but it'd still be possible to protect against it at runtime (if too many states need to be tracked within the automaton, we could throw a matching exception). Note that a non-deterministic automaton for the above regular expression is actually pretty simple!

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Whoa, this issue almost dropped past the event horizon on my TODO list! I'll revive the patch and push soon ...

I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton

I think this is interesting, but let's explore it on a future issue?

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I think this is interesting, but let's explore it on a future issue?

Absolutely!

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

(Adrien) I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.

I think the factory isn't going to stand in the way of either tokenizer evolving. A problem with separate factories is that the name PatternTokenizerFactory is already an excellent name, nor does it have hints as to how it works. In general I don't like polluting the namespace with different implementations of effectively the same thing; the first impl to show up grabs the best name. The Factory provides an excellent opportunity to bridge these multiple implementations.

Yet alas, my arguments aren't swaying anyone so go ahead.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 93fa72f77bd024aa09eef043c65c64a6524613dc in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=93fa72f

LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit c24e03e6bf4d09e6f31eee8192bb6c0c4b2b6d27 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c24e03e

LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

My Jenkins found a reproducing seed on master for a TestRandomChains failure that implicates the new tokenizer:

   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
   [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-17,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]   2>    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>    at org.junit.Assert.fail(Assert.java:93)
   [junit4]   2>    at org.junit.Assert.failNotEquals(Assert.java:647)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:128)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:472)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> 
   [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-18,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]   2>    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>    at org.junit.Assert.fail(Assert.java:93)
   [junit4]   2>    at org.junit.Assert.failNotEquals(Assert.java:647)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:128)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:472)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> 
   [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-19,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]   2>    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>    at org.junit.Assert.fail(Assert.java:93)
   [junit4]   2>    at org.junit.Assert.failNotEquals(Assert.java:647)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:128)
   [junit4]   2>    at org.junit.Assert.assertEquals(Assert.java:472)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]   2>    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> 
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@754b8c24)
   [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(org.apache.lucene.analysis.MockCharFilter@6f0a7841)
   [junit4]   2> tokenizer=
   [junit4]   2>   org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizer(org.apache.lucene.util.automaton.Automaton@aa8d0c)
   [junit4]   2> filters=
   [junit4]   2> offsetsAreCorrect=true
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=3ABEF2F287EE4968 -Dtests.slow=true -Dtests.locale=es-US -Dtests.timezone=America/Montreal -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   1.17s J1 | TestRandomChains.testRandomChainsWithLargeStrings <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: some thread(s) failed
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:562)
   [junit4]    >    at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=70, name=Thread-17, state=RUNNABLE, group=TGRP-TestRandomChains]
   [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=72, name=Thread-19, state=RUNNABLE, group=TGRP-TestRandomChains]
   [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=71, name=Thread-18, state=RUNNABLE, group=TGRP-TestRandomChains]
   [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=false): {}, locale=es-US, timezone=America/Montreal
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=358995304,total=524288000

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @sarowe, I'll dig.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5ca3ca205234694224341331980216a07c8e518b in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ca3ca2

LUCENE-7465: use the right random instance otheerwise we hit creepy test failures

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 74f208d716ef25aaead97e674c5296f2be54eb76 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74f208d

LUCENE-7465: use the right random instance otheerwise we hit creepy test failures

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I pushed a fix... sneaky wrong random instance usage.

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Another reproducing TestRandomChains master seed, from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/19011/:

   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='Sy \ud98b\udc04\uff52\u0384\u942fP\u040a\u0004\u0455 |uh)a)mrB- '
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@29890127, [<ACRONYM_DEP>])
   [junit4]   2> tokenizer=
   [junit4]   2>   org.apache.lucene.analysis.pattern.SimplePatternTokenizer(org.apache.lucene.util.automaton.Automaton@9bd88db)
   [junit4]   2> filters=
   [junit4]   2> offsetsAreCorrect=true
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=821B9B2715E2264F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=el-GR -Dtests.timezone=America/Lima -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.15s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: finalOffset expected:<24> but was:<23>
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([821B9B2715E2264F:E84024364CAC06BC]:0)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
   [junit4]    >    at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=true): {}, locale=el-GR, timezone=America/Lima
   [junit4]   2> NOTE: Linux 4.4.0-53-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=12,threads=1,free=457314736,total=508952576

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @sarowe; I'll have a look.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 2d03aa21a2b674d36e201f6309e646f37771b73b in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d03aa2

LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit c3028b32207b8837cdaf29918edd4e0cdc9621ad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c3028b3

LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That test failure was actually a real bug in both SimplePatternTokenizer and SimpleSplitPatternTokenizer! Yay for TestRandomChains ;) I pushed a fix.

apache / lucene

Add a PatternTokenizer that uses Lucene's RegExp implementation [LUCENE-7465] #8517