Closed asfimport closed 10 years ago
Chris M. Hostetter (@hossman) (migrated from JIRA)
Easy steps to reproduce using the example configs...
hossman@frisbee:\~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' -H 'Content-Type: application/csv' --data-binary `@-`
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">572</int></lst>
</response>
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{
"responseHeader":{
"status":0,
"QTime":12,
"params":{
"fl":"id,name",
"indent":"true",
"q":"*:*",
"wt":"json"}},
"response":{"numFound":2,"start":0,"docs":[
{
"name":"a",
"id":"0"},
{
"name":"z",
"id":"1"}]
}}
hossman@frisbee:\~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true'
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":"long_s:*",
"wt":"json"}},
"response":{"numFound":1,"start":0,"docs":[
{
"name":"a",
"long_s":"aaa",
"id":"0",
"_version_":1459225819107819520}]
}}
Chris M. Hostetter (@hossman) (migrated from JIRA)
Things i'm confident of...
if (docState.maxTermPrefix != null && docState.infoStream.isEnabled("IW")) {
docState.infoStream.message("IW", "WARNING: document contains at least one immense term (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" + docState.maxTermPrefix + "...'");
docState.maxTermPrefix = null;
}
}
Things i think i understand, but am not certain of...
Rough idea only half considered...
thoughts?
Michael McCandless (@mikemccand) (migrated from JIRA)
Actually, I think Lucene should just throw an exc when this happens? DWPT should be the right place (it isn't a different thread)...
Separately this limit is absurdly large
Chris M. Hostetter (@hossman) (migrated from JIRA)
I think Lucene should just throw an exc when this happens? ... (it isn't a different thread)
I wasn't entire sure about that – and since it currently does an infoStream but does not throw an exception, i assumed that was because of the threading.
If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.
Michael McCandless (@mikemccand) (migrated from JIRA)
If you think we should convert this to a LUCENE issue and throw a RuntimeException i'm all for that.
+1
I don't think this leniency is good: it hides that your app is indexing trash.
Chris M. Hostetter (@hossman) (migrated from JIRA)
revising summary & description in prep for converting to LUCENE issue
Chris M. Hostetter (@hossman) (migrated from JIRA)
Here's a quick pass at trying to fix this along with a test.
at the moment the test fails because i didn't see any immediately obvious way to get the fieldname into the exception message, and that seems kind of key to making it useful (yes a byte prefix of the term is there, but for most people indexing text that's not going to be immediately helpful to them to understand where to look for the long term)
I haven't dug down deeper to see if it would be safe/easy to just add the fieldname to docState.maxTermPrefix (as a prefix on the prefix) nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.
Chris M. Hostetter (@hossman) (migrated from JIRA)
nor have i run any other tests to see if throwing an exception here breaks any other existing tests that happen to depend on big ass terms being silently ignored.
no surprises jumped out at me when running all tests, although there was an un-surprising failure from TestIndexWriter.testWickedLongTerm which asserts the exact opposite of the new test i added here: that a doc with a "wicket long" term can still be added.
Assuming people think the overall change should happen, removing/fixing that test would be trivial.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Hoss!
I tweaked the patch a bit, to get the field name into the exc message, and moved where we throw the exception. I also fixed testWickedLongTerm to verify the exc is thrown, but other documents just indexed are not lost.
Chris M. Hostetter (@hossman) (migrated from JIRA)
I tweaked the patch a bit
Looks pretty good, although "maxTermField" in DocumentsWriterPerThread looks like a new unused variable ... cruft from a different approach you were going to take?
Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?
Can/should we make throwing an exception dependent on the Version used in the IndexWriterConfig? (it's not clear to me if enough IW context is passed down to DocInverterPerField currently to even access the Version)
Robert Muir (@rmuir) (migrated from JIRA)
I would prefer we just change it in trunk and 4.x, rather than something tricky.
Michael McCandless (@mikemccand) (migrated from JIRA)
cruft from a different approach you were going to take?
Woops, yes, nuke it please :)
Do you think this should be committed only to trunk, or trunk & 4x - or is it too much of a runtime behavior change to put on 4x?
I think a hard break here is fine; I think it's a service to our users if we notify them that they are indexing trash.
Varun Thacker (@vthacker) (migrated from JIRA)
Attached new patch -
Chris M. Hostetter (@hossman) (migrated from JIRA)
I would prefer we just change it in trunk and 4.x, rather than something tricky.
Works for me – we can note it in the backcompat section
Added Solr Tests
Thanks for writing those tests Varun!
I've tweaked the patch slightly...
I think this is ready? speak now or i'll commit tomorrow.
ASF subversion and git services (migrated from JIRA)
Commit 1574595 from hossman@apache.org in branch 'dev/trunk' https://svn.apache.org/r1574595
LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH
ASF subversion and git services (migrated from JIRA)
Commit 1574637 from hossman@apache.org in branch 'dev/branches/branch_4x' https://svn.apache.org/r1574637
LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH (merge r1574595)
Chris M. Hostetter (@hossman) (migrated from JIRA)
thanks everybody!
Uwe Schindler (@uschindler) (migrated from JIRA)
Close issue after release of 4.8.0
As reported on the solr-user list, when a term is greater then 2^15 bytes it is silently ignored at indexing time – a message is logged in to infoStream if enabled, but no error is thrown.
seems like we should change this behavior (if nothing else starting in 5.0) to throw an exception.
Migrated from LUCENE-5472 by Chris M. Hostetter (@hossman), resolved Mar 05 2014 Attachments: LUCENE-5472.patch (versions: 4)