Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length [LUCENE-7705]

asfimport commented 7 years ago

@ErickErickson: Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final. KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema. For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory. Any objections?

Migrated from LUCENE-7705 by Amrit Sarkar, resolved May 28 2017 Attachments: LUCENE-7705, LUCENE-7705.patch (versions: 10) Linked issues:

8908
- SOLR-10186
- SOLR-10229

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

I have cooked up a patch in SOLR-10186, and introduced new constructor in CharTokenizer and related Tokenizer factories, which takes maxCharLen and factory as parameters along with it.

Kindly provide your feedback and any comments on introducing new constructors in the classes. Thanks.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Patch that fixes up a few comments, regularized maxChars* to maxToken* and the like. I enhanced a test to test tokens longer than 256 characters.

There was a problem with LowerCaseTokenizerFactory, the getMultiTermComponent method constructed a LowerCaseFilterFactory with the original arguments including maxTokenLen, which then threw an error. There's a nocommit in there for the nonce, what's the right thing to do here?

@amrit sarkar Do you have any ideas for a more elegant solution? The nocommit is there because this is feels just too hacky, but it does prove that this is the problem.

It seems like we should close SOLR-10186 and just make the code changes here. With this patch I successfully tested adding fields with tokens longer than 256 and shorter, so I don't think there's anything beyond this patch to do with Solr. I suppose we could add some maxTokenLen bits to some of the schemas just to exercise that (which would have found the LowerCaseTokenizerFactory bit).

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

I think the patch I uploaded is a result of applying your most recent patch for SOLR-10186 but can you verify? We should probably consolidate two, I suggest we close the Solr one as a duplicate and continue iterating here.

Erick

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

These two tests fail: org.apache.lucene.analysis.core.TestUnicodeWhitespaceTokenizer.testParamsFactory org.apache.lucene.analysis.core.TestRandomChains (suite)

TestUnicideWhitespaceTokenizer is because I added "to" to one of the exception messages, a trivial fix

No idea what's happening with TestRandomChains though.

And my nocommit insures that 'ant precommit' will fail too, that's rather the point.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Oops, forgot to "git add" on the new test file.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Erick,

Successfully able to pass all the tests in the current patch uploaded with minor corrections and rectifications in exiting test-classes.

modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java
modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java
new file:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java
modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java
modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java
modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java

Test failure fixes:

org.apache.lucene.analysis.core.TestRandomChains (suite):

Added the four tokenizer constructors failing to brokenConstructors map to bypass them without delay. This class tends to check what arguments is legal for the constructors and create certain maps before-hand to check later. It doesn't take account of boxing/unboxing of primitive data types; hence when we are taking parameter in "java.lang.Integer", while creating map it is unboxing it into "int" itself and then fails because "int.class" and "java.lang.Integer.class" doesn't match which doesn't make sense. Either we can fix how the maps are getting created or we skip these constructors for now.
the getMultiTermComponent method constructed a LowerCaseFilterFactory with the original arguments including maxTokenLen, which then threw an error:

Not sure what corrected that, but I see no suite failing, not even TestFactories which I suppose was throwing the error for incompatible constructors/noMethodFound etc. Kindly verify if we are still facing the issue or we need to harden the test cases for the same.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Actually, I didn't encounter the error in LowerCaseFilterFactory until I tried it out from a fully-compiled Solr instance with the maxTokenLen in the managed_schema file. I was thinking that it might make sense to add the maxTokenLen to a couple of the schemas used by some of the test cases, leaving it at the default value or 256 just to get some test coverage. I think this is really the difference between a test case at the Lucene level and one based on the schema from a Solr level.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Roger that! I will incorporate test-cases for the stated tokenizers and try to fix the issue without removing maxTokenLength from the method.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Erick,

I wrote the test-cases and it is a problem, but removing "maxTokenLen" from original arguments which initialize LowerCaseFilterFactory makes sense, and it is not hack. We have to remove the argument for the FilterFactory init somewhere and it will be better if we do where we are making the call. I am not inclined towards removing this at FilterFactory init or AbstractAnalysisFactory func call. So we are left with two options, either we don't provide option for maxTokenLen for LowerCaseTokenizer or we remove the extra argument as you have done on getMultiTermComponent().

Let me know your thoughts.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Removing the extra arguments in the getMulitTermComponent() is certainly better than in any of the superclasses, you'd have the possibility of interfering with someone else's filter that did coincidentally have a maxTokenLen parameter that should legitimately be passed through.

I guess that removing it in getMultiTermComponent() is OK. At least the place that gets the maxTokenLength argument (i.e. the factory) being responsible for removing it before passing the args on to the Filter keeps things from sprawling.

Although the other possibility is to just pass an empty map rather than munge the original ones. LowerCaseFilter's first act is to check whether the map is empty after all. Something like:

return new LowerCaseFilterFactory(Collections.EMPTY_MAP); (untested).

I see no justification for passing the original args anyway in this particular case, I'd guess it was just convenient. I think I like the EMPTY_MAP now that I think about it, but neither option is really all that superior IMO. The EMPTY_MAP will be slightly more efficient but I doubt it's really measurable.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Erick,

For every tokenizer init, two parameters are already included in their arguments as below:

{class=solr.LowerCaseTokenizerFactory, luceneMatchVersion=7.0.0}

which is consumed by AbstractAnalysisFactory while it instantiate:

  originalArgs = Collections.unmodifiableMap(new HashMap<>(args));
    System.out.println("orgs:: "+originalArgs);
    String version = get(args, LUCENE_MATCH_VERSION_PARAM);
    if (version == null) {
      luceneMatchVersion = Version.LATEST;
    } else {
      try {
        luceneMatchVersion = Version.parseLeniently(version);
      } catch (ParseException pe) {
        throw new IllegalArgumentException(pe);
      }
    }
    args.remove(CLASS_NAME);  // consume the class arg
  }

class parameter is useless, we don't have to worry about it, while it do look up for luceneMatchVersion which is kind of sanity check for the versions, not sure anything important takes place at Version::parseLeniently(version) function. If we can confirm that, we can pass empty map there.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Ah, good catch. So yeah, we'll have to remove the maxTokenLen from the original args and pass the modified map through.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

I did some adjustments, including removing maxTokenLen for LowerCaseFilterFactory init and included hard test cases for the tokenizers at solr level: TestMaxTokenLenTokenizer.java

    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/KeywordTokenizerFactory.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizer.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LetterTokenizerFactory.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizer.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseTokenizerFactory.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/UnicodeWhitespaceTokenizer.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizer.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.java
    modified:   lucene/analysis/common/src/java/org/apache/lucene/analysis/util/CharTokenizer.java
    new file:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestKeywordTokenizer.java
    modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestRandomChains.java
    modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUnicodeWhitespaceTokenizer.java
    modified:   lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestCharTokenizers.java
    new file:   solr/core/src/test-files/solr/collection1/conf/schema-tokenizer-test.xml
    new file:   solr/core/src/test/org/apache/solr/util/TestMaxTokenLenTokenizer.java

I think we have covered everything, all tests passed.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Patch looks good. I'm going to hang back on committing this until we figure out SOLR-10229 (control schema proliferation). The additional schema you put in here is about the only way currently to test Solr schemas, so that's perfectly appropriate. I'd just like to use this as a test case for what it would take to move constructing schemas to inside the tests rather than have each new case like this require another schema that we then have to maintain.

But if SOLR-10229 takes very long I'll just commit this one and we can work out the rest later.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

May or may not depend on SOLR-10229, but we want to approach them both together.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@rmuir @mikemccand Wanted to check whether the discussion at #8813 applies here or not.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think that #8813 discussion is unrelated: it has to do with exposing such things again at the Analyzer level, vs going the customanalyzer route

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Thanks!

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Fixed a couple of precommit issues, otherwise patch is the same.

Amrit:

this test always fails for me: TestMaxTokenLenTokenizer

assertQ("Check the total number of docs", req("q", "letter:lett"), "//result[@numFound=0]");

Looking at the code, numFound should be 1 I believe. The problem is that both the index time and query time analysis trims the term to 3 characters, so the finding a document when searching for "lett" here is perfectly legitimate. In fact all tokens no matter how long and no matter what follows "let" will succeed. I think all the rest of the tests for fields in this set will fail for a similar reason when checking for search terms > the length of the token. Do you agree?

If you agree, let's add a few tests explicitly showing this, that way future people looking at the code will know it's intended behavior. I.e. add lines like:

// Anything that matches the first three letters should be found when maxLen=3 assertQ("Check the total number of docs", req("q", "letter:letXyz"), "//result[@numFound=1]");

Or I somehow messed up the patch.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Erick,

I have absolutely no idea how I uploaded false/incomplete patch. I certainly understand the above and incorporated two different configurations to show the difference at first place, refined the same as per your latest comments.

There is one serious issue I am facing for a day, all the tests passes on IntelliJ IDE but the same gets failed when I do "ant -Dtestcase=TestMaxTokenLenTokenizer test". I don't know what to do with that.

The latest patch you uploaded is incomplete as the newly created files are not part of the patch. I have worked on the previous one (dated: 27-07-17). You may need to change the pre-commit changes on the latest patch.

Sorry for the hiccup.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Precommit passes, let's use this patch as a basis going forward. The "ant precommit" task will show you all the failures.

So I'm not sure what's up with the test, "ant -Dtestcase=TestMaxTokenLenTokenizer test" works just fine for me. What errors do you see?

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Yes Erick, I saw the "ant precommit" errors, tab instead of whitespaces, got it.

I am still seeing this:

   [junit4] Tests with failures [seed: C3F5B66314F27B5E]:
   [junit4]   - org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers -Dtests.seed=C3F5B66314F27B5E -Dtests.slow=true -Dtests.locale=fr-CA -Dtests.timezone=Asia/Qatar -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.10s | TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([C3F5B66314F27B5E:A927890C4C11AB91]:0)
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:896)
   [junit4]    >    at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1]
   [junit4]    >    xml response was: <?xml version="1.0" encoding="UTF-8"?>
   [junit4]    > <response>
   [junit4]    > <lst name="responseHeader"><int name="status">0</int><int name="QTime">11</int></lst><result name="response" numFound="0" start="0"></result>
   [junit4]    > </response>
   [junit4]    >    request was:q=letter0:lett&wt=xml
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:889)
   [junit4]    >    ... 40 more

But if it working for you, I am good.

You didn't include the newly created files again in the latest patch, I have posted a new one with "precommit" sorted and included all the files.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Sorry 'bout that, rushing out the door this morning and didn't do the git add bits.

We'll get it all right yet.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Erick, we got everything right this time.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Amrit:

I'm using LUCENE-7705 (note no .patch extension). Is that the right one?

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Why do the tokenizer ctor arg take Integer instead of int? They check for null and throws an exception in that case, but I think its better to encode the correct argument type in the method signature. This would also mean you can re-enable TestRandomChains for the newly-added ctors.

I think if basic tokenizers like WhitespaceTokenizer have new ctors that take int, and throw exception on illegal values for this parameter because some aren't permitted, that these new ctors really could use a bit better docs:

@param maxTokenLen max token length the tokenizer will emit

Instead, maybe something like:

@param maxTokenLen maximum token length the tokenizer will emit. Must not be negative. @throws IllegalArgumentException if maxTokenLen is invalid.

Also I am concerned about the allowable value ranges and the disabling of tests as far as correctness. Instead if int is used instead of Integer this test can be re-enabled which may find/prevent crazy bugs. e.g. I worry a bit what will these Tokenizers do in each case if maxTokenLen is 0 (which is allowed), will they loop forever or crash, etc?

The maximum value of the range is also suspicious. What happens with any CharTokenizer-based tokenizer if i supply a value > IO_BUFFER_SIZE? Maybe it behaves correctly, maybe not. I think Integer.MAX_VALUE is a trap for a number of reasons: enormous values will likely only blow up anyway (limited to 32766 in the index), if they don't they may create hellacious threadlocal memory usage in buffers, bad/malicious input could take out JVMs, etc.

Maybe follow the example of StandardTokenizer here:

/** Absolute maximum sized token */
public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
...
   * `@throws` IllegalArgumentException if the given length is outside of the
   *  range [1, {`@value` #MAX_TOKEN_LENGTH_LIMIT}].
...
if (length < 1) {
  throw new IllegalArgumentException("maxTokenLength must be greater than zero");
} else if (length > MAX_TOKEN_LENGTH_LIMIT) {
  throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
}

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

One nit and one suggestion and one question in addition to Robert's comments:

The nit: there's a pattern of a bunch of these:

updateJ("{\"add\":{\"doc\": {\"id\":1,\"letter\":\"letter\"}},\"commit\":{}}",null); . . then: assertU(commit());

It's unnecessary to do the commit with the updateJ calls. the commit at the end will take care of it all. It's a little less efficient to commit with each doc. Frankly I doubt that'd be measurable, performance wise, but let's take them out anyway.

The suggestion: When we do stop accruing characters e.g. in CharTokenizer, let's log an INFO level message to that effect, something like

log.info("Splitting token at {} chars", maxTokenLen);

That way people will have a clue where to look. I think INFO is appropriate rather than WARN or ERROR since it's perfectly legitimate to truncate input, I'm thinking OCR text for instance. Maybe dump the token we've accumulated so far?

I worded it at "splitting" because (and there are tests for this) that the next token picks up where the first left off. So if the max length is 3, and the input is "whitespace", we get several tokens as a result,

"whi", "tes", "pac", and "e".

I suppose that means that the offsets are also incremented. Is that really what we want here? Or should we instead throw away the extra tokens? @rmuir, what do you think is correct? This is not a change in behavior, the current code does the same thing just a hard-coded 255 limit. I'm checking if this is intended behavior.

If we do want to throw away the extra, we could spin through the buffer until we encountered a non char then return the part < maxTokenLen. If we did that we could also log the entire token and the truncated version if we wanted.

Otherwise precommit passes and all tests pass.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sorry, I think its a very bad idea to log tokens in lucene's analysis chains for any reason.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Fair enough, I suppose you could get log flies 20 bazillion bytes long.

WDYT about generating multiple tokens when the length is exceeded?

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think its within the scope of this issue.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Final patch, added CHANGES.txt etc. Incorporates @rmuir's suggestions, thanks Robert!: 1> Integer parameters are now int 2> put tokenizers back in TestRandomChains (well, really took out the special handling of these tokenizers) 3> added tests for 0 len and > StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT (1M, which is still very large). 4> Added better comments in the javadocs 5> added test for input > I/O buffer size.

Unless there are objections I'll commit this tomorrow.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Tests such as this are not effective. If no exception is thrown the test will pass.

+    try {
+      new LetterTokenizer(newAttributeFactory(), 0);
+    } catch (Exception e) {
+      assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage());
+    }

I would use expectThrows in this case instead.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

final patch, incorporates R. Muir's comments (thanks!). What bugs me is that I know that pattern is invalid but didn't catch it.. Siiigggghhhh.

Committing momentarily, will backport to 6.7.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Oh bother. commit number. Commit to master is: 906679adc80f0fad1e5c311b03023c7bd95633d7 Commit to 6x is: 3db93a1f0980bd098af892630c6dc95c4c962e14

asfimport commented 7 years ago

Tomas Eduardo Fernandez Lobbe (@tflobbe) (migrated from JIRA)

TestMaxTokenLenTokenizer seems to be failing in Jenkins

1 tests failed.
FAILED:  org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
        at __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
        at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957)
        at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1]
        xml response was: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="0" start="0"></result>
</response>

        request was:q=letter0:lett&wt=xml
        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:888)
        ... 40 more

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Here's another reproducing failure https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/3612:

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers -Dtests.seed=FE4BE1CA39C9E0DA -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Australia/NSW -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.07s J1 | TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([FE4BE1CA39C9E0DA:9499DEA5612A3015]:0)
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:895)
   [junit4]    >    at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1]
   [junit4]    >    xml response was: <?xml version="1.0" encoding="UTF-8"?>
   [junit4]    > <response>
   [junit4]    > <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="0" start="0"></result>
   [junit4]    > </response>
   [junit4]    >    request was:q=letter0:lett&wt=xml
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:888)
[...]
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {lowerCase0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), whiteSpace=Lucene50(blocksize=128), letter=BlockTreeOrds(blocksize=128), lowerCase=Lucene50(blocksize=128), unicodeWhiteSpace=BlockTreeOrds(blocksize=128), letter0=FST50, unicodeWhiteSpace0=FST50, keyword0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), keyword=Lucene50(blocksize=128), whiteSpace0=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))}, docValues:{}, maxPointsInLeafNode=579, maxMBSortInHeap=5.430197160407458, sim=RandomSimilarity(queryNorm=false,coord=crazy): {}, locale=vi-VN, timezone=Australia/NSW
   [junit4]   2> NOTE: Linux 4.10.0-21-generic amd64/Oracle Corporation 1.8.0_131 (64-bit)/cpus=8,threads=1,free=234332328,total=536870912

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Hmmm, Steve's seed succeeds on my macbook pro, and I beasted this test 100 times on my mac pro without failures. Not quite sure what to do next...

I wonder if SOLR-10562 is biting us again? Although that doesn't really make sense as the line before this failure finds the same document.

I'll beast this on a different system I have access to when I can (it's shared).

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

My seed reproduces for me both on Linux and on my Macbook pro (Sierra 10.12.5, Oracle JDK 1.8.0_112). Note that the original failure was on branch_6x (and that's where I repro'd).

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

OK, I see what's happening. I noted earlier that the way this has always been implemented, multiple tokens are emitted when the token length is exceeded. In this case, the token sent in the doc is "letter". So two tokens are emitted: "let" and "ter". With positions incremented between I think.

The search is against "lett". For some reason, the parsed query in 6x is: PhraseQuery(letter0:"let t")

while in master it's: letter0:let letter0:t

Even this is wrong, it just happens to succeed because the default operator is OR, so the fact that and the tokens in the index do not include a bare "t" finds the doc by chance, not design.

I think the right solution is to stop emitting tokens for a particular value once maxTokenLen is exceeded. I'll raise a new JIRA and we can debate it here.

This is not any change in behavior resulting from the changes in this JIRA, the tests just expose something that's always been the case but nobody's noticed.

asfimport commented 7 years ago

Amrit Sarkar (migrated from JIRA)

Following discussion on #8908 and Erick's pointers,

introduced autoGeneratePhraseQueries="false" on the all the FieldType definitions for this test case. all the test scenarios are successfully executed on both master and branch_6x. Patch attached.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 15a8a2415280d50c982fcd4fca893a3c3224da14 in lucene-solr's branch refs/heads/master from Erick https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=15a8a24

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token len (test fix)

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 2eacf13def4dc9fbea1de9c79150c05682b0cdec in lucene-solr's branch refs/heads/master from Erick https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2eacf13

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length, fix test failure.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit e4a43cf59a12ca39eb8278cc2533d409d792185a in lucene-solr's branch refs/heads/branch_6x from Erick https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e4a43cf

LUCENE-7705: Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token len (test fix)

(cherry picked from commit 15a8a24)

apache / lucene

Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length [LUCENE-7705] #8756

8908