apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

StraightBytesDocValuesField fails if bytes > 32k [LUCENE-4583] #5648

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

I didn't observe any limitations on the size of a bytes based DocValues field value in the docs. It appears that the limit is 32k, although I didn't get any friendly error telling me that was the limit. 32k is kind of small IMO; I suspect this limit is unintended and as such is a bug. The following test fails:

  public void testBigDocValue() throws IOException {
    Directory dir = newDirectory();
    IndexWriter writer = new IndexWriter(dir, writerConfig(false));

    Document doc = new Document();
    BytesRef bytes = new BytesRef((4+4)*4097);//4096 works
    bytes.length = bytes.bytes.length;//byte data doesn't matter
    doc.add(new StraightBytesDocValuesField("dvField", bytes));
    writer.addDocument(doc);
    writer.commit();
    writer.close();

    DirectoryReader reader = DirectoryReader.open(dir);
    DocValues docValues = MultiDocValues.getDocValues(reader, "dvField");
    //FAILS IF BYTES IS BIG!
    docValues.getSource().getBytes(0, bytes);

    reader.close();
    dir.close();
  }

Migrated from LUCENE-4583 by David Smiley (@dsmiley), 1 vote, resolved Aug 16 2013 Attachments: LUCENE-4583.patch (versions: 8)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

good god no.

DocValues are not stored fields...

This reinforces the value of the limit!

asfimport commented 11 years ago

selckin (migrated from JIRA)

Ok, from the talks i watched on them & other info gathered it seemed like it would be a good fit, guess i really missed the point somewhere, can't find much info in the javadocs either, but guess this is for the user list and i shouldn't pollute this issue

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Should the "closed" status and resolution change to "not a problem" mean that @mikemccand improvement's in his patch here (that don't change the limit) won't get applied? They looked good to me. And you?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I still think we should fix the limitation in core; this way apps that want to store large binary fields per-doc are able to use a custom DVFormat.

+1 arbitrary limits are not a feature.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another iteration on the patch:

An alternative to the protected method would be to have two separate tests in the base class, one test verifying a clean IllegalArgumentExc is thrown when the value is too big, and another verifying huge binary values can be indexed successfully. And then I'd fix each DVFormat's test to subclass and @Ignore whichever base test is not appropriate.

But I don't think this would simplify things much? Ie, TestDocValuesFormat would still need logic to check depending on the default codec.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch looks good. I prefer the current way of the test (the 'protected' method).

Also, you have a printout in Lucene40DocValuesWriter after the "if (b.length > MAX_BINARY)" - remove/comment?

+1 to commit.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Cool; I didn't know of the Facet42 codec with its support for large doc values. Looks like I can use it without faceting. I'll have to try that.

+1 to commit.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch.

Thanks for the review Shai; I removed that leftover print and sync'd patch to trunk. I think it's ready ... I'll wait a few days.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I'm confused by the following comment:

+  /** Maximum length for each binary doc values field,
+   *  because we use PagedBytes with page size of 16 bits. */
+  public static final int MAX_BINARY_FIELD_LENGTH = (1 << 16) + 1;

But this patch removes the PagedBytes limitation, right?

After this patch, are there any remaining code limitations that prevent raising the limit, or is it really just self imposed via

+      if (v.length > Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH) {
+        throw new IllegalArgumentException("DocValuesField \"" + field.name + "\" is too large, must be <= " + Lucene42DocValuesFormat.MAX_BINARY_FIELD_LENGTH);
+      }
asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Another user has hit this arbitrary limit: http://markmail.org/message/sotbq6xpib4xwozz If it is arbitrary at this point, we should simply remove it.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm confused by the following comment:

I fixed the comment; it's because those DVFormats use PagedBytes.fillSlice, which cannot handle more than 2 pages.

New patch w/ that fix ...

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, I'm going to roll the limit for 40/42 DVFormats back to what IndexWriter currently enforces ... this way we don't get into a situation where different 40/42 indices out there were built with different limits enforced.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1514669 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1514669

LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1514848 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1514848

LUCENE-4583: IndexWriter no longer places a limit on length of DV binary fields (individual codecs still have their limits, including the default codec)

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close