DocValues field broken on large indexes [LUCENE-4547]

asfimport commented 12 years ago

I tried to write a test to sanity check #5602 (first running against svn revision 1406416, before the change).

But i found docvalues is already broken here for large indexes that have a PackedLongDocValues field:

final int numDocs = 500000000;
for (int i = 0; i < numDocs; ++i) {
  if (i == 0) {
    field.setLongValue(0L); // force > 32bit deltas
  } else {
    field.setLongValue(1<<33L); 
  }
  w.addDocument(doc);
}
w.forceMerge(1);
w.close();
dir.close(); // checkindex

[junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-Test2GBDocValues]
[junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: -65536
[junit4:junit4]   2>    at __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
[junit4:junit4]   2>    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:535)
[junit4:junit4]   2>    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:508)
[junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65536
[junit4:junit4]   2>    at org.apache.lucene.util.ByteBlockPool.deref(ByteBlockPool.java:305)
[junit4:junit4]   2>    at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(FixedStraightBytesImpl.java:115)
[junit4:junit4]   2>    at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(PackedIntValues.java:109)
[junit4:junit4]   2>    at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(PackedIntValues.java:80)
[junit4:junit4]   2>    at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:130)
[junit4:junit4]   2>    at org.apache.lucene.codecs.PerDocConsumer.merge(PerDocConsumer.java:65)

Migrated from LUCENE-4547 by Robert Muir (@rmuir), resolved Feb 08 2013 Attachments: LUCENE-4547.patch, test.patch Linked issues:

SOLR-3855
- 5106
- 5782
- 4803
- 5108
- 5606
- SOLR-5231

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here was my initial test, just screwing around.

I ran with 'ant test -Dtestcase=Test2GBDocValues -Dtests.nightly=true -Dtests.heapsize=5G'

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There is even a out-of-coffee bug in the test, its only using like 2 bits per value :) So this is really even worse.

I'm not sure we should be using ByteBlockPool etc here. I think it shouldnt be used outside of the indexer.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

editing description: I think it affects more than PackedIntValues actually?

I think the bug is in how FixedStraightBytesImpl uses byteblockpool.

So this means the problem should be way more widespread: e.g. if you have lots of documents in general I think you are fucked (as norms should trip it too).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Another bug is that I had to pass tests.heapsize at all.

I think its bad that docvalues gobbles up so much ram when merging. Cant we merge this stuff from disk?

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm having a look at the branch, and it looks great! I like the fact that there are less types and that values are buffered into memory so that the doc values format can make decisions depending on the number of distinct values, ... Still, I have some questions on what you plan to do with this branch:

do you plan to use this branch to:
fix other issues such as #4935?
merge the FieldCache / FunctionValues / DocValues.Source APIs?
are you going to remove DocValues.Type.FLOAT_*?
are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?
are you going to remove hasArray/getArray?
will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

These are hard questions. My personal goal here for this prototype (currently SimpleText only!) was to:

Making merging use (significantly) less RAM, to fix this bug.
Make it easier to write docvalues codecs, to encourage innovations (e.g. FST impls, etc etc)
Simplify the types to make it easier on the user.

the consumer api I think is simpler (part of #2), but I would like to (in the future) simplify the producer API too. I'm not sure if we should do it here though? anyway we can think about the issues you raised one by one and do them separately on their own issues.

fix other issues such as #4935?

Its my opinion we should do this sooner than later.

merge the FieldCache / FunctionValues / DocValues.Source APIs?

This really needs to be addressed, but I think not here. Its horrific that algorithms like grouping, sorting, and maybe faceting have to be duplicated for 2 different things (fieldcache and docvalues).

are you going to remove DocValues.Type.FLOAT_*?

I think the 3 types we have here are enough. Someone can do a float or double type "on top of" the "number" type we have. Lucene is already doing this today: look at norms. I think lucene should just have a number type that stores bits.

are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?

This is the idea, once we are happy with the APIs we would implement the 4.0 ones with these apis.

are you going to remove hasArray/getArray?

I don't care about this. I am unsure similarity impls should be calling this though, definitely at least it would be better for them to fall-back: I just cant bring myself to fix it until #4935 is fixed :)

will there still be a direct=true|false option at load-time or will it depend on the format impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?

I don't want to change this in the branch. Personally i feel like a codec/segmentreader/etc should generally only manage direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) that the consumer apis get (which will also make merging more efficient!) default source impl can be something nice, read the direct impl into a packed ints, and so on. Codec could override to e.g. just slurp in their on-disk packed ints directly. So codec still has control of the in-memory RAM representation, i think this is important. But i think codec and segmentreader should somehow not be in control of caching: this should be elsewhere (FieldCache.DOCVALUES.xxx????)...

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

About the current SimpleDVConsumer API: NumericDocValuesConsumer addNumericField(FieldInfo field, long minValue, long maxValue) only allows for compression based on the width of the range but there are many other ways to compress values (maybe there are very few unique values, maybe the last 3 bits are always the same, etc...). Now that we buffer values into memory, what would you think of changing the API to pass an iterable of longs instead so that the doc values format can make better decisions?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like that idea! So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing. Should we not include the long min/maxValues then...?

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

So codec could iterate once, gathering whatever stats it needs, and then iterate again to do the writing.

Yep.

Should we not include the long min/maxValues then...?

I think so? And we could do something similar with addBinaryField and addSortedField. I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

FYI - I just added simple numeric & binary impls for Lucene41 test demo passes for Lucene41 execpt of the sorted test (no impls yet)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

What would the flush/merge api look like? Would it get simple or more complicated? Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl?

I think there are many possible optimizations based on how much lengths vary, whether bytes refs share prefixes or not, ...

maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today. When designing these apis, to me these are the only ones that exist...

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The API could look like

class DocValuesConsumer {

  // add all values in a single call
  void addNumericField(FieldInfo field,  Collection<Long> values); // values.size() == numDocs
  void addBinaryField(FieldInfo field, Collection<BytesRef> values); // values.size() == numDocs
  void addSortedField(FieldInfo field, Collection<BytesRef> ordToValue, Collection<Long> docToOrd); // docToOrd.size() == numDocs, ordToValue.size() == valueCount

  // same merge API

}

I don't see why merging would be more complicated but maybe I'm missing something? The default merge impls would need to use lazy collection impls in order to remain memory-efficient.

Could we still require certain stats from the Producer, so that we can have a default, efficient in-RAM Source impl?

Why would we need stats to make a Source impl efficient?

maybe, but arguably we should do the simplest possible thing that can work given the codecs we have today

To me it looks as simple as the current API.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We don't need to be passing numDocs today: this should be removed because its duplicated with SegmentWriteState
Merging becomes more complex because the api requires this collection view: today it does not.
We need stats to make the Source impl efficient so we can e.g. use packed integers for Number type. Bye bye arrays.

I don't think we should make anything more flexible than necessary until things like the producer api are sorted out. Otherwise its difficult to see the tradeoffs.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I don't think we should make anything more flexible than necessary until things like the producer api are sorted out.

I can wait for the producer API. I just wanted to point out that by exposing all the values at once, the DocValuesFormat could make more clever choices than by just exposing the width of the range of values.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I committed an initial attempt at a simpler producer API ... it's rough!! But at least TestDemoDocValue passes w/ SimpleText.

I moved SimpleText's "loaded into RAM" wrappers up into SimpleDVProducer; this way a codec only must impl the direct source and can impl in-RAM source if it wants to.

SegmentCoreReaders now does the caching of in-RAM sources.

Simon I temporarily disabled the Lucene41DV producer ... I'll go get it working too ...

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey folks,

I looked at the branch and I would want to suggest we move a little slower here. we are doing too many things at once. Like I really don't like the trend to make FieldCache the single source for caching. FieldCache has many problems in my opininon like is uses this DEFAULT singleton, has a single way of how things are cached per reader some users might want to use different access to DV like in ES we don't use FieldCache at all for many reasons.I think we are going into the right direction here but exposing everything through FC is a no-go IMO. I do see why we should merge the interfaces and expose un-inverted fields via the new DV interface - nice! but hiding it behind FC is no good. I also don't like the way how "in-ram" DV are exposed. I don't think we should have newRAMInstance() on the interface. Lets keep the interface clean and don't mix in how it is represented. I'd rather vote for dedicated producers or SimpleDocValuesProducer#getNumericDocValues(boolean inMemory). Then we can still do caching on top. The producer should really be simple and shouldn't do caching. We can also separate the default in-memory impls in a simple helper class with methods like static NumericDocValues load(NumericDocValues directDocValues)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If you don't want like FieldCache in ES, then use something else.

Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful. If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene.

Please try to give the branch some time. I committed some work last night very late and got tired and went to sleep. Its not "ready" and i'm not threatening to commit to trunk.

I created this branch to develop publicly. I don't have to do that: I can develop privately instead and not deal with complaints about my every step before I even say things are ready. I just thought it would make it easier for other people to help.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Its really broken that things like grouping and faceting are coded to a separate API. This makes DV unsuccessful.

don't get me wrong I think this adds a lot of value. BUT, if the only way to get a DV instance that is cached is FC then this entire thing is inconsistent. We don't ask people to cache TermEnum which can be heavy if you use MemoryPostings for instance. We neither do for StoredFields nor do we have a caching layer that is not used by default. All I am arguing for is that if somebody wants to use DV it should be simple to do so. The distinction between in-memory and on disk should not be in FC.

If we arent going to fix DocValues and Faceting APIs then perhaps we shoudl consider removing DocValues completely from lucene.

you mean remove fieldcache?

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Just an idea though... while we are on it should we maybe add a 4th type that allows multiple values. that way we can just pull DV from any field and uninvert if needed?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

while we are on it should we maybe add a 4th type that allows multiple values

+1. That might allow using DVs for faceted search.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think this is necessary. Someone can do this "on top" of a binary impl themselves.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Hey folks,

I thought about the IN-RAM vs. ON-Disk distinction we have in DV at this point and how we distinguish API wise. IMO calling $TypeDocValues#newRAMInstance() with different behavior if the instance is already a ram instance is kind of ugly API wise as well as having a binary distinction here might not be sufficient. From my point of view it would logically make most sense to allow the codec to decide if it is in ram or not or if only parts of the values are in memory like in the sorted case where you might wanna use a FST holding a subset of the values. Now giving the control entirely to the code might not be practical. Think about merging where you really don't want to load into memory you should be able to tell don't pull into memory. We can do this already today if we pass in IOContext. Yet, IOContext is the wrong level since its a reader wide setting and might not be true for all fields in the case we open a reader for handling searches. Yet, the idea of IOContext is basically to pass information about the access pattern where merge means sequential access. We might want to use something similar for docvalues that allows us to leave most of the decisions to the codec but if a user decides he really needs stuff in memory he can still pass in something like AccessPattern.SEQUENTIAL and load the values into an auxiliary datastructure. This would allow the codec to optimize under the hood but not making any promises if it's in ram or on disk if AccessPattern.DEFAULT is passed.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think multi-valued case could be compelling, but we should probably do later / outside this branch. EG FieldCache already supports this (DocTermOrds). It's true that app could do this on top of Binary DV, but I think it's useful enough that a real impl would be worthwhile (eg for facets).

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think letting the codec control in-RAM vs on-disk is a great idea!

Why not let merging load values into RAM if your DVFormat is a RAM-backed impl? The codec can always override merging if it wants to ...

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think letting the codec control in-RAM vs on-disk is a great idea!

actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time. Most of those decisions need to be made dynamically based on ram availability and growth.

what I was saying is that the user should provide its intend so the codec can optimize.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think letting the codec control in-RAM vs on-disk is a great idea!

actually that is not what I was saying and I strongly discourage that we require people to make ram vs. on disk decisions ahead of time.

I think this is actually a clean way to do it, and it matches what we do with other codec parts. Eg with postings you pick MemoryPF if you have the free RAM and want fast lookups for that field, else you pick an on-disk postings format.

Most of those decisions need to be made dynamically based on ram availability and growth.

I think making dynamic decisions based on ram availability and growth is a more expert use case; eg in Lucene today we don't give you that: Deleted docs, norms, field cache entries, doc values (if you sort by them), terms index are all loaded into RAM. So the only control users have now is which fields they index/sort on...

If we give control to the codec over whether the DV format is in RAM or on disk or something in between (like the terms index), and we make a PerFieldDVFormat so you can easily switch impls by field, then users can make the decisions themselves, field by field.

If a given field will be used for sorting or faceting, they can use the fast RAM-based format, but if they are tight on RAM and have lots of scoring factors, maybe they use the disk-based impl for those fields.

If an expert app really need to pick & choose ram vs disk dynamically, depending on how many other indices are open and how much RAM they are using, etc., they can always make a custom DV format ...

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

If an expert app really need to pick & choose ram vs disk dynamically, depending on how many other indices are open and how much RAM they are using, etc., they can always make a custom DV format ...

what I am worried about is the lack of communication between the app and the codec. something like this is going to be a major hassle. all I am asking about is to pass in "hints" to the codec what I need at a certain point per field. We can't do this and I think we shouldn't allow this. its an encoding / decoding layer and it should be simple. pushing what you call "experts" to write their own codecs is a major trap I think. writing a codec is last resort and causes major trouble for non-lucene devs IMO. This is expertexpert :)

I really like the idea of perfieldDV and I think we should do it. I am just not a big fan of making up-front decisions for this stuff when it comes to on-disk vs. ram. PostingsFormat is a different story, the on disk (low ram useage) have such a perf characteristics that you very unlikely need something else useing lots of ram. For sorting, grouping or scoring you will certainly need that.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

one way of merging the two approaches would be a simple boolean that forces on-disk access. that way we can solve the following problems:

default merge impls. a format doesn't need to override merge if its an in-ram by default impl
people can build auxiliary structures on top without worrying about ram
leave decisions to the codec what in-memory structure or on-disk structure it uses by default

this way we can keep the api clean and support expert users. We can even make this package private so that "normal" users won't see it?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This is sounding too complicated. I think it sounds ok to remove the distinction of ram/on-disk and just have codec. If someone wants to do expert stuff in their codec thats fine, but we don't need it in our impls or abstract APIS.

Lets start with just getting the basics working here and the integration with the rest of lucene simple.

Today (in trunk) on-disk access is a pipe dream (thus, this issue) because the codec api is responsible for too much.

Expert users and flexibility should be our last priority.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This is sounding too complicated.

a single boolean is too complicated? all I ask for is a way to prevent loading into ram if not necessary. We had this in 4.0 and I think we should make this work in 4.1 too. remember this is a different use-case than postings. I really don't think I ask for much here.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

a single boolean is too complicated?

I think it is, I feel like it really confuses the API and makes writing codecs harder.

I think it would be better if the codec impl determined this, just like MemoryPostings and so on. So I'd rather have Per-field dv wrapper that configures this.

For example someone would use a different implementation for their solr __version field than they would use for a scoring factor, and maybe a different implementation for a sort field than a faceting one.

I don't think there is a use case to be able to access a single field's values both from RAM and on disk, and for the codec to have to deal with that. It makes things currently very complicated.

We had this in 4.0 and I think we should make this work in 4.1 too.

I don't think thats necessarily true. In 4.0 the one DV impl we had could do a lot, but the codec API is very difficult. I actually contributed to a lot of the codec apis in Lucene, and as a committer I was unable to figure out how to write a working DV impl to this api. I think this says a lot.

I'd rather have a simpler codec API, that enables innovation so that we can see cool shit in the future, like implementations geared at sorting and faceting that use less RAM, and so on.

If someone really needs more fine-grained control than per-field codec API, then there are other ways to achieve that: FileSwitchDirectory, adding such APIs to their own codec, etc. But I'm not sure its mainstream and should be required by all codecs.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I just got an error at search time fetching DocValues in 4.0.0:

SEVERE: null:java.lang.ArrayIndexOutOfBoundsException
        at org.apache.lucene.util.PagedBytes$Reader.fillSlice(PagedBytes.java:97)
        at org.apache.lucene.codecs.lucene40.values.VarStraightBytesImpl$VarStraightSource.getBytes(VarStraightBytesImpl.java:273)

I pass in a scratch BytesRef, and give a docId local to the segment.

Could it be related to this issue? This bug is freaking me out a bit as I may be forced to abandon DocValues.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

David: I'm not sure. the problem I opened this issue for actually relates to things like merging.

Can you do a manual bounds check, as docvalues at read time doesn't have explicit checks (IndexOutOfBounds is expected/best effort if you pass a wrong docid):

    if (docID < 0 || docID >= reader.maxDoc())

Otherwise if thats not the problem, do you have a test or something you could upload to an issue?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Just a quick recap on where things stand on the branch:

We have the DV 2.0 API, shadowing DV 1.0 API.
We have one codec (SimpleText) that implements it, passes tests
CheckIndex does basic tests of DV 2.0, and we also have TestDemoDocValue, but nothing else is cutover yet.
Lucene41 codec's impl is I think close but was failing some tests (not sure why yet)
We have a MemoryDV but it's very RAM inefficient now
We have Norms 2.0 API too, shadowing current norms, and only SimpleText implements it (but should be easy to get Lucene41 to impl it too).
We need to cut over all uses/tests of DV 1.0 / norms 1.0 and then remove DV/norms 1.0 shadow code.
There are still tons and tons of nocommits ...

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Applyable patch to trunk r1442822.

I think this is close: jenkins-blasted, benchmarked, beefed up tests and added many new ones, several codecs with different tradeoffs, per-field configuration, 4.0 file format compat, more efficient 4.2 format, and so on.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Big +1!

I especially like the fact that doc values

now use the same interface as FieldCache,
can be configured on a per-field basis thanks to a SPI,
are first buffered in memory (with RAM accounting) so that the codec has more opportunities to perform optimizations.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1443717

LUCENE-4547: DocValues improvements

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1443834

LUCENE-4547: DocValues improvements

asfimport commented 11 years ago

Markus Jelsma (migrated from JIRA)

Revision 1443717 breaks custom collector code such as:

int[] ints = FieldCache.DEFAULT.getInts(context.reader(), this.fieldName, false);

Do you suggest we should keep the returned instance of Ints but where is the concrete Ints class? Can't seem to find it.

Thanks

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

where is the concrete Ints class?

This class is an inner class of FieldCache: oal.search.FieldCache.Ints. Then you should be able to fix your code by replacing ints[i] with ints.get( i ).

asfimport commented 11 years ago

Markus Jelsma (migrated from JIRA)

Oh i see, FieldCacheImpl returns and overrides Ints.get(int docId).

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.

apache / lucene

DocValues field broken on large indexes [LUCENE-4547] #5613

5106

5782

4803

5108

5606