ArrayIndexOutOfBoundsException when iterating over TermDocs [LUCENE-2666]

asfimport commented 14 years ago

A user got this very strange exception, and I managed to get the index that it happens on. Basically, iterating over the TermDocs causes an AAOIB exception. I easily reproduced it using the FieldCache which does exactly that (the field in question is indexed as numeric). Here is the exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470) at TestMe.main(TestMe.java:56)

It happens on the following segment: _26t docCount: 914 delCount: 1 delFileName: _26t_1.del

And as you can see, it smells like a corner case (it fails for document number 912, the AIOOB happens from the deleted docs). The code to recreate it is simple:

    FSDirectory dir = FSDirectory.open(new File("index"));
    IndexReader reader = IndexReader.open(dir, true);

    IndexReader[] subReaders = reader.getSequentialSubReaders();
    for (IndexReader subReader : subReaders) {
        Field field = subReader.getClass().getSuperclass().getDeclaredField("si");
        field.setAccessible(true);
        SegmentInfo si = (SegmentInfo) field.get(subReader);
        System.out.println("--> " + si);
        if (si.getDocStoreSegment().contains("_26t")) {
            // this is the probleatic one...
            System.out.println("problematic one...");
            FieldCache.DEFAULT.getLongs(subReader, "__documentdate", FieldCache.NUMERIC_UTILS_LONG_PARSER);
        }
    }

Here is the result of a check index on that segment:

8 of 10: name=_26t docCount=914 compound=true hasProx=true numFiles=2 size (MB)=1.641 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_26t_1.del] test: open reader.........OK [1 deleted docs] test: fields..............OK [32 fields] test: field norms.........OK [32 fields] test: terms, freq, prox...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: stored fields.......ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: term vectors........ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:721) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:515) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47)

The creation of the index does not do something fancy (all defaults), though there is usage of the near real time aspect (IndexWriter#getReader) which does complicate deleted docs handling. Seems like the deleted docs got written without matching the number of docs?. Sadly, I don't have something that recreates it from scratch, but I do have the index if someone want to have a look at it (mail me directly and I will provide a download link).

I will continue to investigate why this might happen, just wondering if someone stumbled on this exception before. Lucene 3.0.2 is used.

Migrated from LUCENE-2666 by Shay Banon (@kimchy), 1 vote, updated Feb 07 2011 Attachments: checkindex-out.txt

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This looks like index corruption – somehow the deleted docs bit vector is too small for that segment. We have to get to the root cause of how the corruption happened.

EG if you can enable IndexWriter's infoStream, then get the corruption to happen, and post the resulting log...

Also, try enabling assertions... it may catch the corruption sooner.

Can you describe how you use Lucene? Do you do any direct file IO in the index dir? (eg, for backup/restore or something).

Are you certain only one writer is open on the index? (Do you disable Lucene's locking?)

Which OS, filesystem, java impl are you using?

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi,

I am getting this issue as well? We are doing quite a lot of update updates during indexing. Could this be causing the problem ?

This seems to only have happened when we deployed to our linux test server - it didn't appear to occur on MAC OS X during development - with the same data set.

Does this only affect Lucene 3.0.2 ? Would a rollback be a good work around ?

The exact stack strace:

java.lang.ArrayIndexOutOfBoundsException: 5475
        at org.apache.lucene.util.BitVector.get(BitVector.java:104)
        at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
        at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
        at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:207)
        at org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:52)
        at org.apache.lucene.search.PhraseScorer.advance(PhraseScorer.java:120)
        at org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:249)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:218)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:199)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:177)
        at org.apache.lucene.search.MultiSearcher$MultiSearcherCallableWithSort.call(MultiSearcher.java:410)
        at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:230)
        at org.apache.lucene.search.Searcher.search(Searcher.java:49)

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

I've also noticed this occurring since I started using a numeric field and accessing the its field cache for boosting.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can you run CheckIndex on this index and post the result? And, enable assertions.

And if possible turn on IndexWriter's infoStream and capture/post the output leading up to the corruption.

Many updates during indexing is just fine... and I know whether rolling back to older Lucene releases will help (until we've isolated the issue). But: maybe try rolling forward to 3.0.3? It's possible you're hitting a big fixed in 3.0.3 (though this doesn't ring a bell for me).

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi MIchael,

Thanks for the update. I have added an infoStream to the writer and triggered a re-index. Unfortunately, I didn't see the corruption occur this time. I am about to deploy to a different environment so will let you know.

We are already upgraded to Lucene 3.0.3, unfortunately.

Hopefully we will see the problem re-occur and be able to capture the necessary output to track down the problem.

I've also added a call to writer.prepareCommit(). Previously, only writer.commit() was being called. Could that have an effect ?

Cheers, Nick

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK thanks. Hopefully we can catch this under infoStream's watch.

Not calling prepareCommit is harmless – IW simply calls it for you under the hood when commit() is called, if you hadn't already called prepareCommit().

The two APIs are separate in case you want to involve Lucene in a 2 phased commit w/ other resources.

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi Michael,

We managed to catch this happening again. I've created a bug for our project over at: http://jira.atlassian.com/browse/CRUC-5486 ( Since I can't seem to upload the log to this JIRA instance?).

My hunch is that this occurs if a search is performed at the same time as a re-index - and a lucene cache is potentially not being closed/cleared correctly. It appears that a re-start of the application causes this problem to go away.

Cheers, Nick.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Nick; I'll look at the log.

Aside: you should be able to attach files here... not sure why you saw otherwise...

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Nick, the infoStream output looks healthy – I don't see any exceptions. Can you post the output from CheckIndex against the index that corresponds to this infoStream?

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi Michael,

Thanks for looking at that log. I ran CheckIndex on the corrupt index - and have attached the output here. It doesn't appear to have detected any problems.

Do you think this problem could be caused by a cache not being flushed correctly ?

Cheers, Nick

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmmm — given that exception, I would expect CheckIndex to have also seen this issue.

Searching at the same time as indexing shouldn't cause this. Lucene doesn't cache postings, but does cache metadata for the term, though I can't see how that could lead to this exception.

This could also be a hardware issue? Do you see the problem on more than one machine?

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi MIchael,

We have now seen this issue on more than 1 machine. I don't think it is a hardware issue. We are using the ConcurrentMergeScheduler on the writer - so not sure if that has known issues? A restart definitely 'fixes' this problem though.

The stack-trace is:

java.lang.ArrayIndexOutOfBoundsException: 3740
    at org.apache.lucene.util.BitVector.get(BitVector.java:104)
    at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
    at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
    at org.apache.lucene.search.PhrasePositions.next(PhrasePositions.java:41)
    at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:147)
    at org.apache.lucene.search.PhraseScorer.nextDoc(PhraseScorer.java:78)
    at org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(DisjunctionSumScorer.java:101)
    at org.apache.lucene.search.DisjunctionSumScorer.<init>(DisjunctionSumScorer.java:85)
    at org.apache.lucene.search.BooleanScorer2$1.<init>(BooleanScorer2.java:154)
    at org.apache.lucene.search.BooleanScorer2.countingDisjunctionSumScorer(BooleanScorer2.java:149)
    at org.apache.lucene.search.BooleanScorer2.makeCountingSumScorerNoReq(BooleanScorer2.java:218)
    at org.apache.lucene.search.BooleanScorer2.makeCountingSumScorer(BooleanScorer2.java:208)
    at org.apache.lucene.search.BooleanScorer2.<init>(BooleanScorer2.java:101)
    at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:336)
    at org.apache.lucene.search.function.CustomScoreQuery$CustomWeight.scorer(CustomScoreQuery.java:359)
    at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:306)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:210)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:170)
    at org.apache.lucene.search.MultiSearcher$MultiSearcherCallableNoSort.call(MultiSearcher.java:363)
    at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:208)
    at org.apache.lucene.search.Searcher.search(Searcher.java:98)

I am going to spend some time trying to reproduce this locally today, with a debugger attached.

Cheers, Nick

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi Michael,

We've done some analysis on how we are using Lucene and discovered the following:

the only time we construct a new reader IndexReader.open(directory, true) is when we search the index for the first time since the server start.

every other time, we are using reader.reopen() each time we detect that a write has occurred to the index.

                          final IndexReader newReader = oldReader.reopen();
                          if (newReader != oldReader) {
                              oldReader.decRef();
                              reader = newReader;
                          }

the bug definitely goes away when the system is restarted and a new Reader is instantiated.
once we see the AIOOBE, it happens on every search until we restart
running CheckIndex never reports any errors

Therefore we believe that reader.reopen() is most likely causing certain data structures to be shared and creates inconsistency which leads to this exception.

The latest stack trace we are getting is in the comment above.

Given this information would you have any more clues for us?

Thank you very much for your help so far, greatly appreciated. Nick

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi Michael,

We have a memory dump of the instance that is affected by this. Would you know the best place to start looking for the possibly outdated BitVector? We could make this available to you if you wish - all 1.8GB of it though :(

Cheers, Nick

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Nick, are you running Lucene w/ asserts enabled? Are you able to take a src patch and run it through your test? If so, I can add some verbosity/asserts and we can try to narrow this down.

It does sound like somehow the wrong delete BitVector is getting associated w/ a SegmentReader.

It looks like you don't use NRT readers right? Ie, you always .commit() from IW and then do IR.reopen?

asfimport commented 13 years ago

Nick Pellow (migrated from JIRA)

Hi Michael,

This issue was entirely a problem with our code, and I doubt Lucene could have done a better job.

The problem was that on upgrade of the index (done when fields have changed etc), we recreate the index in the same location using IndexWriter.create(directory, analyzer, true, MAX_FIELD_LENGTH).

Some code was added just before this however, that deleted every single file in the directory. This meant that some other thread performing a search could have seen a corrupt index, thus causing the AIOOBE. The developer was paranoid that IndexWriter.create was leaving old files lying around.

I'm glad we got to the bottom of this, and very much so that it was not a bug in Lucene!

Thanks again for helping us track this down.

Best Regards, Nick Pellow

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh, thanks for bringing closure Nick! Although, I'm a little confused how removing files from the index while readers are using it, could lead to those exceptions...

Note that it's perfectly fine to pass create=true to IW, over an existing index, even while readers are using it; IW will gracefully remove the old files itself, even if open IRs are still using them. IW just makes a new commit point that drops all references to prior segments...

apache / lucene

ArrayIndexOutOfBoundsException when iterating over TermDocs [LUCENE-2666] #3740