apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Long terms should generate a RuntimeException, not just infoStream [LUCENE-5472] #6535

Closed asfimport closed 10 years ago

asfimport commented 10 years ago

As reported on the solr-user list, when a term is greater then 2^15 bytes it is silently ignored at indexing time – a message is logged in to infoStream if enabled, but no error is thrown.

seems like we should change this behavior (if nothing else starting in 5.0) to throw an exception.


Migrated from LUCENE-5472 by Chris M. Hostetter (@hossman), resolved Mar 05 2014 Attachments: LUCENE-5472.patch (versions: 4)

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Easy steps to reproduce using the example configs...

hossman@frisbee:\~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' -H 'Content-Type: application/csv' --data-binary `@-` 
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">572</int></lst>
</response>
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":12,
    "params":{
      "fl":"id,name",
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "name":"a",
        "id":"0"},
      {
        "name":"z",
        "id":"1"}]
  }}
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "indent":"true",
      "q":"long_s:*",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "name":"a",
        "long_s":"aaa",
        "id":"0",
        "_version_":1459225819107819520}]
  }}
asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Things i'm confident of...

Things i think i understand, but am not certain of...

Rough idea only half considered...

thoughts?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, I think Lucene should just throw an exc when this happens? DWPT should be the right place (it isn't a different thread)...

Separately this limit is absurdly large

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I think Lucene should just throw an exc when this happens? ... (it isn't a different thread)

I wasn't entire sure about that – and since it currently does an infoStream but does not throw an exception, i assumed that was because of the threading.

If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.

+1

I don't think this leniency is good: it hides that your app is indexing trash.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

revising summary & description in prep for converting to LUCENE issue

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Here's a quick pass at trying to fix this along with a test.

at the moment the test fails because i didn't see any immediately obvious way to get the fieldname into the exception message, and that seems kind of key to making it useful (yes a byte prefix of the term is there, but for most people indexing text that's not going to be immediately helpful to them to understand where to look for the long term)

I haven't dug down deeper to see if it would be safe/easy to just add the fieldname to docState.maxTermPrefix (as a prefix on the prefix) nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.

no surprises jumped out at me when running all tests, although there was an un-surprising failure from TestIndexWriter.testWickedLongTerm which asserts the exact opposite of the new test i added here: that a doc with a "wicket long" term can still be added.

Assuming people think the overall change should happen, removing/fixing that test would be trivial.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Hoss!

I tweaked the patch a bit, to get the field name into the exc message, and moved where we throw the exception. I also fixed testWickedLongTerm to verify the exc is thrown, but other documents just indexed are not lost.

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I tweaked the patch a bit

Looks pretty good, although "maxTermField" in DocumentsWriterPerThread looks like a new unused variable ... cruft from a different approach you were going to take?

Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

Can/should we make throwing an exception dependent on the Version used in the IndexWriterConfig? (it's not clear to me if enough IW context is passed down to DocInverterPerField currently to even access the Version)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I would prefer we just change it in trunk and 4.x, rather than something tricky.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

cruft from a different approach you were going to take?

Woops, yes, nuke it please :)

Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?

I think a hard break here is fine; I think it's a service to our users if we notify them that they are indexing trash.

asfimport commented 10 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

Attached new patch -

  1. Removed unused variable in DWPT
  2. Added Solr Tests
asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I would prefer we just change it in trunk and 4.x, rather than something tricky.

Works for me – we can note it in the backcompat section

Added Solr Tests

Thanks for writing those tests Varun!

I've tweaked the patch slightly...

I think this is ready? speak now or i'll commit tomorrow.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1574595 from hossman@apache.org in branch 'dev/trunk' https://svn.apache.org/r1574595

LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1574637 from hossman@apache.org in branch 'dev/branches/branch_4x' https://svn.apache.org/r1574637

LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH (merge r1574595)

asfimport commented 10 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

thanks everybody!

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Close issue after release of 4.8.0