Closed asfimport closed 11 years ago
Robert Muir (@rmuir) (migrated from JIRA)
good god no.
DocValues are not stored fields...
This reinforces the value of the limit!
selckin (migrated from JIRA)
Ok, from the talks i watched on them & other info gathered it seemed like it would be a good fit, guess i really missed the point somewhere, can't find much info in the javadocs either, but guess this is for the user list and i shouldn't pollute this issue
David Smiley (@dsmiley) (migrated from JIRA)
Should the "closed" status and resolution change to "not a problem" mean that @mikemccand improvement's in his patch here (that don't change the limit) won't get applied? They looked good to me. And you?
Michael McCandless (@mikemccand) (migrated from JIRA)
I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.
Yonik Seeley (@yonik) (migrated from JIRA)
I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.
+1 arbitrary limits are not a feature.
Michael McCandless (@mikemccand) (migrated from JIRA)
Another iteration on the patch:
I added constants MAX_BINARY_FIELD_LENGTH to Lucene4{0,2}DocValuesFormat, and then reference that in the Writer/Consumer to catch too-big values.
I added another test to BaseDocValuesFormatTestCase, to test the exact maximum length value.
I fixed that test failure, by passing the String field to the codecAcceptsHugeBinaryValues method, and adding a _TestUtil helper method to check this.
An alternative to the protected method would be to have two separate
tests in the base class, one test verifying a clean IllegalArgumentExc
is thrown when the value is too big, and another verifying huge binary
values can be indexed successfully. And then I'd fix each DVFormat's
test to subclass and @Ignore
whichever base test is not appropriate.
But I don't think this would simplify things much? Ie, TestDocValuesFormat would still need logic to check depending on the default codec.
Steven Rowe (@sarowe) (migrated from JIRA)
Bulk move 4.4 issues to 4.5 and 5.0
Shai Erera (@shaie) (migrated from JIRA)
Patch looks good. I prefer the current way of the test (the 'protected' method).
Also, you have a printout in Lucene40DocValuesWriter after the "if (b.length > MAX_BINARY)" - remove/comment?
+1 to commit.
David Smiley (@dsmiley) (migrated from JIRA)
Cool; I didn't know of the Facet42 codec with its support for large doc values. Looks like I can use it without faceting. I'll have to try that.
+1 to commit.
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch.
Thanks for the review Shai; I removed that leftover print and sync'd patch to trunk. I think it's ready ... I'll wait a few days.
Yonik Seeley (@yonik) (migrated from JIRA)
I'm confused by the following comment:
+ /** Maximum length for each binary doc values field,
+ * because we use PagedBytes with page size of 16 bits. */
+ public static final int MAX_BINARY_FIELD_LENGTH = (1 << 16) + 1;
But this patch removes the PagedBytes limitation, right?
After this patch, are there any remaining code limitations that prevent raising the limit, or is it really just self imposed via
+ if (v.length > Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH) {
+ throw new IllegalArgumentException("DocValuesField \"" + field.name + "\" is too large, must be <= " + Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH);
+ }
Yonik Seeley (@yonik) (migrated from JIRA)
Another user has hit this arbitrary limit: http://markmail.org/message/sotbq6xpib4xwozz If it is arbitrary at this point, we should simply remove it.
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm confused by the following comment:
I fixed the comment; it's because those DVFormats use PagedBytes.fillSlice, which cannot handle more than 2 pages.
New patch w/ that fix ...
Michael McCandless (@mikemccand) (migrated from JIRA)
Actually, I'm going to roll the limit for 40/42 DVFormats back to what IndexWriter currently enforces ... this way we don't get into a situation where different 40/42 indices out there were built with different limits enforced.
ASF subversion and git services (migrated from JIRA)
Commit 1514669 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1514669
LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)
ASF subversion and git services (migrated from JIRA)
Commit 1514848 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1514848
LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)
Adrien Grand (@jpountz) (migrated from JIRA)
4.5 release -> bulk close
I didn't observe any limitations on the size of a bytes based DocValues field value in the docs. It appears that the limit is 32k, although I didn't get any friendly error telling me that was the limit. 32k is kind of small IMO; I suspect this limit is unintended and as such is a bug. The following test fails:
Migrated from LUCENE-4583 by David Smiley (@dsmiley), 1 vote, resolved Aug 16 2013 Attachments: LUCENE-4583.patch (versions: 8)