StraightBytesDocValuesField fails if bytes > 32k [LUCENE-4583]

asfimport commented 11 years ago

I didn't observe any limitations on the size of a bytes based DocValues field value in the docs. It appears that the limit is 32k, although I didn't get any friendly error telling me that was the limit. 32k is kind of small IMO; I suspect this limit is unintended and as such is a bug. The following test fails:

  public void testBigDocValue() throws IOException {
    Directory dir = newDirectory();
    IndexWriter writer = new IndexWriter(dir, writerConfig(false));

    Document doc = new Document();
    BytesRef bytes = new BytesRef((4+4)*4097);//4096 works
    bytes.length = bytes.bytes.length;//byte data doesn't matter
    doc.add(new StraightBytesDocValuesField("dvField", bytes));
    writer.addDocument(doc);
    writer.commit();
    writer.close();

    DirectoryReader reader = DirectoryReader.open(dir);
    DocValues docValues = MultiDocValues.getDocValues(reader, "dvField");
    //FAILS IF BYTES IS BIG!
    docValues.getSource().getBytes(0, bytes);

    reader.close();
    dir.close();
  }

Migrated from LUCENE-4583 by David Smiley (@dsmiley), 1 vote, resolved Aug 16 2013 Attachments: LUCENE-4583.patch (versions: 8)

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I'm guessing that limit is due to the implementation with PagedBytes? These limits were sensible when applied to indexed values, but obviously not so much to stored values (unless we decide that DocValues are only meant for smallish values and document the limit).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The most important thing: if this implementation (or if we decide dv itself) should be limited, then it should check this at index-time and throw a useful exception.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

FWIW the app the triggered this has a document requiring \~68k but there's a long tail down such that most documents only need 8 bytes. As a hack, I could use multiple fields to break out of the 32k limit and concatenate each together (yuck). It'd be great if this 32k limit wasn't there.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not concerned if the limit is 10 bytes.

if it is, then it is what it is.

Its just important that IW throw exception at index-time when any such limit is exceeded.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The only correct bugfix here is a missing check.

any discussion about extending limitations in the supported lengths belongs on another issue.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Looking at the issue name and the problem that David ran into, this issue is certainly about more than a missing check during indexing. Small hidden limits can still cause stuff to blow up in production - the user may not have thought to test anything above 32K. Small limits need to be documented.

Like David, I also suspect that the limit was unintended and represents a bug. The question is on a practical level, how easy is it to raise the limit, and are there any negative consequences of doing so? If it's not easy (or there are negative consequences), I think it's OK to leave it at 32K and document it as a current limitation. Off the top of my head, I can't really think of use cases that would require more, but perhaps others might?

Of course we should also fail early if someone tries to add a value above that limit.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Again, I really think adding the check and documenting current limits is what should happen here.

Just like length of indexed terms are limited to 32k, its a bigger issue to try to deal with this (especially in the current DV situation).

I think also if you are putting very large values in DV, you know its perfectly acceptable to require a custom codec for this kind of situation. the one we provide can be general purpose.

I dont think we should try to cram in a huge change to the limits masqueraded as a bug... the bug is not recognizing the limit at index time (and of course documenting it in the codec, or indexwriter, depending on where it is (currently here i think its the codec).

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I dont think we should try to cram in a huge change to the limits masqueraded as a bug...

If the limit was not intentional then it was certainly a bug (not just a missing check). Now we have to figure out what to do about it.

This issue is about deciding what the limit should be (and 32K may be fine, depending on the trade-offs, as I said), and then documenting and enforcing that limit. For example your previous "I'm not concerned if the limit is 10 bytes." would get a -1 from me as "clearly not big enough... lets fix it".

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just like changing the length here gets a -1 from me. Pretty simple.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Just like changing the length here gets a -1 from me. Pretty simple.

Regardless of how easy or hard it might be, regardless of what use cases are brought up (I was waiting to hear from David at least, since he hit it), regardless of what the trade-offs might be involved with changing an unintended/accidental limit? That's just silly.

It also doesn't make sense to -1 something "here" vs somewhere else. +1s/-1s are for code changes, regardless of where they are discussed.

asfimport commented 11 years ago

Barakat Barakat (migrated from JIRA)

The limitation comes from PagedBytes. When PagedBytes is created it is given a number of bits to use per block. The blockSize is set to (1 << blockBits). From what I've seen, classes that use PagedBytes usually pass in 15 as the blockBits. This leads to the 32768 byte limit.

The fillSlice function of the PagedBytes.Reader will return a block of bytes that is either inside one block or overlapping two blocks. If you try to give it a length that is over the block size it will hit the out of bounds exception. For the project I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning. I fixed this for our project by changing fillSlices to this:

http://pastebin.com/raw.php?i=TCY8zjAi

Test unit: http://pastebin.com/raw.php?i=Uy29BGGJ

After placing this in our Solr instance, the search no longer crashes and returns the correct values when the document has a DocValues field more than 32k bytes. As far as I know there is no limit now. I haven't noticed a performance hit. It shouldn't really affect performance unless you have many of these large DocValues fields. Thank you to David for his help with this.

Edit: This only works when start == 0. Seeing if I can fix it.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I am working on, we need more than 32k bytes for our DocValues. We need that much rarely, but we still need that much to keep the search functioning.

Thanks! I suspected one generic use case would be "normally small, but hard to put an upper bound on". That's great if the only issue really is PagedBytes.fillSlices()! That definitely shouldn't have any performance impact since the first "if" will always be true in the common case.

Edit: This only works when start == 0. Seeing if I can fix it.

A simple loop might be easiest to understand, rather than calculating with DIV and MOD?

asfimport commented 11 years ago

Barakat Barakat (migrated from JIRA)

I updated the code to work when start isn't zero. The code can still crash if you ask for a length that goes beyond the total size of the paged bytes, but I'm not sure how you guys like to prevent things like that. The code seems to be working fine with our Solr core so far. I am new to posting patches and writing test units in Java so please let me know if there is anything wrong with the code.

asfimport commented 11 years ago

Barakat Barakat (migrated from JIRA)

I recently switched from 4.1 to 4.3, and my patch needed to be updated because of the changes to DocValues. The problem was almost fixed for BinaryDocValues, but it just needed one little change. I've attached a patch that removes the BinaryDocValues exception when the length is over BYTE_BLOCK_SIZE (32k), fixes ByteBlockPool#readBytes:348, and changes the TestDocValuesIndexing#testTooLargeBytes test to check for accuracy.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dont think we should just bump the limit like this.

the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things).

But in general anyway I'm not sure what the real use cases are for storing > 32kb inside a single document value.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

A user hit this limitation on the user list a couple weeks ago while try to index a very large number of facets: http://markmail.org/message/dfn4mk3qe7advzcd. So this is one usecase, and also it appears there is an exception thrown at indexing time?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Over 6,000 facets per doc?

Yes there is an exception thrown at index time. I added this intentionally and added a test that ensures you get it.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Over 6,000 facets per doc?

Right. Not sure how realistic is his case (something about proteins, he explains it here: http://markmail.org/message/2wxavktzyxjtijqe), and whether he should enable facet partitions or not, but he agreed these are extreme cases and expects only few docs like that. At any rate, you asked for a usecase :). Not saying we should support it, but it is one. If it can be supported without too much pain, then perhaps we should. Otherwise, I think we can live w/ the limitation and leave it for the app to worry about a workaround / write its own Codec.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

For the faceting use case, we have SortedSetDocValuesField, which can hold 2B facets per-field and you can have.... maybe 2B fields per doc.

So the limitation here is not docvalues.

BINARY datatype isnt designed for faceting.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should fix the limit in core, and then existing codecs should enforce their own limits, at indexing time?

This way users that sometimes need to store > 32 KB binary doc value can do so, with the right DVFormat.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

For the faceting use case, we have SortedSetDocValuesField

No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ...

BINARY datatype isnt designed for faceting.

Maybe, but that's the best we have for now.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, this is for one type of faceting, no hierarchies etc. It's also slower than the BINARY DV method ...

Its not inherently slower. I just didnt spend a month inlining vint codes or writing custom codecs like you did for the other faceting method. Instead its the simplest thing that can possibly work right now.

I will call you guys out on this every single time you bring it up.

I'm -1 to bumping the limit

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I looked over the patch carefully and stepped through the relevant code in a debugging session and I think it's good.

I don't see why this arbitrary limit was here. The use-case for why Barakat hit this is related to storing values per document so that a custom ValueSource I wrote can examine it. It's for spatial multi-value fields and some businesses (docs) have a ton of locations out there (e.g. McDonalds). FWIW very few docs have such large values, and if it were to become slow then I have ways to more cleverly examine a subset of the returned bytes. I think its silly to force the app to write a Codec (even if its trivial) to get out of this arbitrary limit.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I iterated from Barakat's patch: improved the test, and added enforcing of the limit in the appropriate DocValuesFormats impls.

Disk, SimpleText, CheapBastard and (I think ... still need a test here) Facet42DVFormat don't have a limit, but Lucene40/42 still do.

I'm -1 to bumping the limit

Are you also against just fixing the limit in the core code (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in the existing DVFormats (my patch)?

I thought that was a good compromise ...

This way at least users can still build their own / use DVFormats that don't have the limit.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

the patch is not safe: Some codecs rely upon this limit (e.g. they use 64k-size pagedbytes and other things).

Rob, can you please elaborate?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Are you also against just fixing the limit in the core code (IndexWriter/BinaryDocValuesWriter) and leaving the limit enforced in the existing DVFormats (my patch)?

I thought that was a good compromise ...

This way at least users can still build their own / use DVFormats that don't have the limit.

I'm worried about a few things:

I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason.
I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption).
I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control.
I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong.

The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term.

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

abusing docvalues as stored fields

Great point. I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true.

Maybe we just need to start with some common use cases, based on size: tiny (16 bytes or less), small (256 or 1024 bytes or less), medium (up to 32K), and large (upwards of 1MB, and larger.) It sounds like large implies stored field.

A related "concern" is dv or stored fields that need a bias towards being in memory and in the heap, vs. a bias towards being "off heap". Maybe the size category is the hint: tiny and small bias towards on-heap, medium and certainly large bias towards off-heap. If people are only going towards DV because they think they get off-heap, then maybe we need to reconsider the model of what DV vs. stored is really all about. But then that leads back to DV somehow morphing out of column-stride fields.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm worried about a few things: I think the limit is ok, because in my eyes its the limit of a single term. I feel that anyone arguing for increasing the limit only has abuse cases (not use cases) in mind. I'm worried about making dv more complicated for no good reason.

I guess I see DV binary as more like a stored field, just stored column stride for faster access. Faceting (and I guess spatial) encode many things inside one DV binary field.

I'm worried about opening up the possibility of bugs and index corruption (e.g. clearly MULTIPLE people on this issue dont understand why you cannot just remove IndexWriter's limit without causing corruption).

I agree this is a concern and we need to take it slow, add good test coverage.

I'm really worried about the precedent: once these abuse-case-fans have their way and increase this limit, they will next argue that we should do the same for SORTED, maybe SORTED_SET, maybe even inverted terms. They will make arguments that its the same as binary, just with sorting, and why should sorting bring in additional limits. I can easily see this all spinning out of control. I think that most people hitting the limit are abusing docvalues as stored fields, so the limit is providing a really useful thing today actually, and telling them they are doing something wrong.

I don't think we should change the limit for sorted/set nor terms: I think we should raise the limit ONLY for BINARY, and declare that DV BINARY is for these "abuse" cases. So if you really really want sorted set with a higher limit then you will have to encode yourself into DV BINARY.

The only argument i have for removing the limit is that by expanding BINARY's possible abuse cases (in my opinion, thats pretty much all its useful for), we might prevent additional complexity from being added elsewhere to DV in the long-term.

+1

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I have to admit that I still don't have a 100% handle on the use case(s) for docvalues vs. stored fields, even though I've asked on the list. I mean, sometimes the chatter seems to suggest that dv is the successor to stored values. Hmmm... in that case, I should be able to store the full text of a 24 MB PDF file in a dv. Now, I know that isn't true.

The big difference is that DV fields are stored column stride, so you can decide on a field by field basis whether it will be in RAM on disk etc., and you get faster access if you know you just need to work with just one or two fields.

Vs stored fields where all fields for one document are stored "together".

Each has different tradeoffs so it's really up to the app to decide which is best... if you know you need 12 fields loaded for each document you are presenting on the current page, stored fields is probably best.

But if you need one field to use as a scoring factor (eg maybe you are boosting by recency) then column-stride is better.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Quoting Mike:

I don't think we should change the limit for sorted/set nor terms: I

think we should raise the limit ONLY for BINARY, and declare that DV BINARY is for these "abuse" cases. So if you really really want sorted set with a higher limit then you will have to encode yourself into DV BINARY.

+1. DV Binary is generic for applications to use as it might see fit. There is no use case to abuse. If this issue passes, I'm not going to then ask for terms > 32k or something silly like that.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I can compromise with this.

However, I don't like the current patch.

I don't think we should modify ByteBlockPool. Instead, I think BinaryDocValuesWriter should do the following:

use PagedBytes to append the bytes (which has append-only writing, via getDataOutput)
implement the iterator with PagedBytes.getDataInput (its just an iterator so this is simple)
store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient)

The only thing ByteBlockPool gives is some "automagic" ram accounting, but this is not as good as it looks anyway.

today, it seems to me ram accounting is broken for this dv type already (the lengths are not considered or am i missing something!?)
since we need to fix that bug anyway (by just adding updateBytesUsed like the other consumers), the magic accounting buys us nothing really anyway.

Anyway I can help with this, tomorrow.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to cutover to PagedBytes!

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, cutting over to PagedBytes. I also revived .getDataInput/Output (from 4.x).

I also added javadocs about these limits in DocValuesType.

There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are still nocommits to resolve ... the biggest one is whether we should fix default DV impl to accept values > 32 KB ... right now it throws IAE.

I do not think this should change. This disagrees from your earlier comments: its already spinning out of control just as I mentioned it might.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

store the lengths instead as absolute offsets with MonotonicAppendingLongBuffer (this should be more efficient)

I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I do not think this should change.

OK I just removed that nocommit.

I just had a quick discussion about this with Robert, and since AppendingLongBuffer stores deltas from the minimum value of the block (and not 0), AppendingLongBuffer is better (ie. faster and more compact) than MonotonicAppendingLongBuffer to store lengths. This means that if all lengths are 7, 8 or 9 in a block, it will only require 2 bits per value instead of 4.

Ahh, OK: I switched back to AppendingLongBuffer.

New patch. I factored the testHugeBinaryValues up into the BaseDocValuesFormatTestCase base class, and added protected method so the codecs that don't accept huge binary values can say so. I also added a test case for Facet42DVFormat, and cut back to AppendingLongBuffer.

I downgraded the nocommit about the spooky unused PagedBytes.blockEnds to a TODO ... this class is somewhat dangerous because e.g. you can use copyUsingLengthPrefix method and then get a .getDataOutput and get corrumpted bytes out.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

this one is looking pretty good, but needs a few fixes. at least, we should fix the tests:

NOTE: reproduce with: ant test -Dtestcase=TestDocValuesFormat -Dtests.method=testHugeBinaryValues -Dtests.seed=5F68849875ABAC05 -Dtests.slow=true -Dtests.locale=es_PY -Dtests.timezone=Canada/Mountain -Dtests.file.encoding=ISO-8859-1

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I like the new test, Mike – in particular it doesn't mandate a failure if the codec accepts > 32k.

I want to make sure it's clear what the logic is behind the decisions being made by Mike & Rob on this thread regarding the limits for binary doc values (not other things). Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k. I don't see anything in this thread establishing a particular use of binary DocValues that established it's intended use; I see it as general purpose as stored values, with different performance characteristics (clearly it's column-stride, for example). The particular use I established earlier would totally suck if it had to use stored values. And the reason for this limit... I'm struggling to find the arguments in this thread but appears to be that hypothetically in the future, there might evolve newer clever encodings that simply can't handle more than 32k. If that's it then wouldn't such a new implementation simply have this different limit, and leave both as reasonable choices by the application? If that isn't it then what is the reason?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Firstly there is no intrinsic technical limitation that the Lucene42Consumer has on these values to perhaps 2GB (not sure but "big"). Yet it is being decided to artificially neuter it to 32k.

This is absolutely not correct. The fact that this limitation exists (based on pagedbytes blocksize) and that nobody sees it makes me really want to rethink what we are doing here: I dont want to allow index corruption, sorry.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I don't follow. From the javadocs, I get the impression that PagedBytes can handle basically any data size. blockBits appears to tune the block size, but I don't see how that limits the total capacity in any significant way. The only method I see that appears to have a limit is copy(BytesRef bytes, BytesRef out) which is only used in uninverting doc terms which doesn't apply to binary DocValues.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

just look at the java docs for the only PagedBytes method that Lucene42's binary dv producer actually uses:

     * ...
     * Slices spanning more than one block are not supported.
     * ...
     **/
    public void fillSlice(BytesRef b, long start, int length) {

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Aha; thanks for the clarification. I see it now. And I see that after I commented the limit check, the assertion was hit. I didn't hit this assertion with Barakat's patch when I last ran it; weird but whatever.

BTW ByteBlockPool doesn't really have this limit, notwithstanding the bug that Barakat fixed in his patch. It's not a hard limit as BBP.append() and readBytes() will conveniently loop for you whereas if code uses PagedBytes then you could loop on fillSlice() yourself to support big values. That is a bona-fide bug on ByteBlockPool that it didn't implement that loop correctly and it should be fixed if not in this issue then another.

So a DocValues codec that supports large binary values could be nearly identical to the current codec but call fillSlice() in a loop, and only for variable-sized binary values (just like BBP's algorithm), and that would basically be the only change. Do you support such a change? If not then why not (a technical reason please)? If you can't support such a change, then would you also object to the addition of a new codec that simply lifted this limit as I proposed? Note that would include potentially a bunch of duplicated code just to call fillSlice() in a loop; I propose it would be simpler and more maintainable to not limit binary docvalues to 32k.

asfimport commented 11 years ago

Barakat Barakat (migrated from JIRA)

Just to clarify, before 4.3 I was fixing the "bug" by changing PagedBytes#fillSlice to the first patch I posted in this issue.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, I don't support changing this codec. Its an all-in-memory one (which is an unfortunate default, but must be until various algorithms in grouping/join/etc package are improved such that we can safely use something more like DiskDV as a default). Other all-memory implementations like DirectPostingsFormat/MemoryPostings have similar limitations, even the specialized faceting one (e.g. entire segment cannot have more than 2GB total bytes).

I dont want to add a bunch of stuff in a loop here or any of that, because it only causes additional complexity for the normal case, and I think its unreasonable to use a RAM docvalues impl if you have more than 32KB per-document cost anyway. Sorry, thats just crazy: and I don't think we should add any additional trappy codec to support that.

So if you want ridiculously huge per-document values, just use DiskDV which supports that. These abuse cases are extreme: if you really really want that all in RAM, then use it with FileSwitchDirectory.

I mentioned before I was worried about this issue spinning out of control, and it appears this has taken place. Given these developments, i'd rather we not change the current limit at all.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

David can you open a separate issue about changing the limit for existing codecs? The default DVFormat today is all-in-RAM, and I think it's OK for it to have limits, and e.g. if/when we change the default to DiskDV, it has no limit.

I think this issue should focus solely on fixing core/indexer to not enforce the limit (i.e., moving the limit enforcing to those DVFormats that have it).

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I can understand that an all in-RAM codec has size sensitivities. In that light, I can also understand that 32KB per document is a lot. The average per-document variable byte length size for Barakat's index is a measly 10 bytes. The maximum is around 69k. Likewise for the user Shai referenced on the list who was using it for faceting, it's only the worst-case document(s) that exceeded 32KB.

Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be made configurable? i.e. Make 16 configurable? And/or perhaps make loadBinary() protected so another codec extending this one can keep the change somewhat minimal.

Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that.

David can you open a separate issue about changing the limit for existing codecs?

Uh... all the discussion has been here so seems too late to me. And I'm probably done making my arguments. I can't be more convincing than pointing out the 10-byte average figure for my use case.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You convinced me dont worry, you convinced me we shouldnt do anything on this whole issue at all. Because the stuff you outlined here is absolutely the wrong path for us to be going down.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is but only by coincidence), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that.

+1

But, again, let's keep this issue focused on not enforcing a limit in the core indexing code.

Per-codec limits are separate issues.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Ok. I have nothing further to say in this issue; I welcome your improvements, Mike.

asfimport commented 11 years ago

selckin (migrated from JIRA)

A few comments up someone asked for a use case, shouldn't something like http://www.elasticsearch.org/guide/reference/mapping/source-field/ be a perfect thing to use BinaryDocValues for?

I was trying to store something similar using DiskDocValuesFormat and hit the 32k limit

apache / lucene

StraightBytesDocValuesField fails if bytes > 32k [LUCENE-4583] #5648