Long terms should generate a RuntimeException, not just infoStream [LUCENE-5472]

asfimport commented 10 years ago

As reported on the solr-user list, when a term is greater then 2^15 bytes it is silently ignored at indexing time – a message is logged in to infoStream if enabled, but no error is thrown.

seems like we should change this behavior (if nothing else starting in 5.0) to throw an exception.

Migrated from LUCENE-5472 by Chris M. Hostetter (@hossman), resolved Mar 05 2014 Attachments: LUCENE-5472.patch (versions: 4)

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Easy steps to reproduce using the example configs...

hossman@frisbee:\~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' -H 'Content-Type: application/csv' --data-binary `@-` 
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">572</int></lst>
</response>
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":12,
    "params":{
      "fl":"id,name",
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "name":"a",
        "id":"0"},
      {
        "name":"z",
        "id":"1"}]
  }}
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "indent":"true",
      "q":"long_s:*",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "name":"a",
        "long_s":"aaa",
        "id":"0",
        "_version_":1459225819107819520}]
  }}

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Things i'm confident of...

the limit is IndexWriter.MAX_TERM_LENGTH
it is not configurable

a message is written to the infoStream by DocFieldProcessor when this is exceeded...

  if (docState.maxTermPrefix != null && docState.infoStream.isEnabled("IW")) {
    docState.infoStream.message("IW", "WARNING: document contains at least one immense term (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '" + docState.maxTermPrefix + "...'");
    docState.maxTermPrefix = null;
  }
}

Things i think i understand, but am not certain of...

by the time DocumentsWriterPerThread sees this problem, and logs this to the infoStream, it's already too late to throw an exception up the call stack (because it's happening in another thread)

Rough idea only half considered...

update the tokenstream producers in Solr to explicitly check the terms they are about to return and throw an exception if they exceed this length (mention using LengthFilter in this error message)
this wouldn't help if people use their own concrete Analyzer class – but it would solve the problem with things like StrField, or anytime analysis factories are used
we could conceivable wrap any user configured concrete Analyzer class to do this check – but i'm not sure we should, since it will add cycles and the Analyzer might already be well behaved.

thoughts?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, I think Lucene should just throw an exc when this happens? DWPT should be the right place (it isn't a different thread)...

Separately this limit is absurdly large

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I think Lucene should just throw an exc when this happens? ... (it isn't a different thread)

I wasn't entire sure about that – and since it currently does an infoStream but does not throw an exception, i assumed that was because of the threading.

If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

+1

I don't think this leniency is good: it hides that your app is indexing trash.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

revising summary & description in prep for converting to LUCENE issue

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Here's a quick pass at trying to fix this along with a test.

at the moment the test fails because i didn't see any immediately obvious way to get the fieldname into the exception message, and that seems kind of key to making it useful (yes a byte prefix of the term is there, but for most people indexing text that's not going to be immediately helpful to them to understand where to look for the long term)

I haven't dug down deeper to see if it would be safe/easy to just add the fieldname to docState.maxTermPrefix (as a prefix on the prefix) nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

no surprises jumped out at me when running all tests, although there was an un-surprising failure from TestIndexWriter.testWickedLongTerm which asserts the exact opposite of the new test i added here: that a doc with a "wicket long" term can still be added.

Assuming people think the overall change should happen, removing/fixing that test would be trivial.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Hoss!

I tweaked the patch a bit, to get the field name into the exc message, and moved where we throw the exception. I also fixed testWickedLongTerm to verify the exc is thrown, but other documents just indexed are not lost.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I tweaked the patch a bit

Looks pretty good, although "maxTermField" in DocumentsWriterPerThread looks like a new unused variable ... cruft from a different approach you were going to take?

Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

Can/should we make throwing an exception dependent on the Version used in the IndexWriterConfig? (it's not clear to me if enough IW context is passed down to DocInverterPerField currently to even access the Version)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I would prefer we just change it in trunk and 4.x, rather than something tricky.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

cruft from a different approach you were going to take?

Woops, yes, nuke it please :)

Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

I think a hard break here is fine; I think it's a service to our users if we notify them that they are indexing trash.

asfimport commented 10 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

Attached new patch -

Removed unused variable in DWPT
Added Solr Tests

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I would prefer we just change it in trunk and 4.x, rather than something tricky.

Works for me – we can note it in the backcompat section

Added Solr Tests

Thanks for writing those tests Varun!

I've tweaked the patch slightly...

renamed Varun's Solr TestIndexing -> TestExceedMaxTermLength so it's more clear what we're testing
fixed a variable typo in the test (longFielValue -> longFieldValue)
fixed patch so that "ant precommit" would pass (String.format was being used w/o a Locale)
removed a stale nocommit comment i'd put in lucene's TestExceedMaxTermLength that Mike fixed a few patches ago
fixed the jdocs for IndexWriter.addDocument to refer to MAX_TERM_LENGTH (ironically it already incorrectly said that terms which were too long would cause an IllegalArgumentException

I think this is ready? speak now or i'll commit tomorrow.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1574595 from hossman@apache.org in branch 'dev/trunk' https://svn.apache.org/r1574595

LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1574637 from hossman@apache.org in branch 'dev/branches/branch_4x' https://svn.apache.org/r1574637

LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH (merge r1574595)

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

thanks everybody!

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Close issue after release of 4.8.0

apache / lucene

Long terms should generate a RuntimeException, not just infoStream [LUCENE-5472] #6535