Closed asfimport closed 18 years ago
Otis Gospodnetic (migrated from JIRA)
Since the bug submitter hasn't followed up to this 'repeatable' issue in almost two months, I shall assume that this was not a bug in Lucene, but a misuse of Lucene.
There is also no self-contained code that demonstrates the bug.
If this is indeed a Lucene bug, please re-open this bug entry, and provise self-sufficient unit test that demonstrates this problem.
Arvind Srinivasan (migrated from JIRA)
More Data Integrity Issue: Docs out of Order
Hi, Seeing similar issue to the one reported in: http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650 On examining the segments, following inconsistencies were found (a) The merging segments had doc number that is greater than maxDoc. Don't know how it go in this state, but this occurs using standard lucene code. (b) Strangely, some documents had terms with zero frequency. And when it occurred, the zero frequency term has several posting as (docid 0) Example.. (docid freq) – MaxDoc = 7749 - NO DELETION. Merging msgBody; text=it; sz=2 — The field name is msgBody and term is "it" and two segments have the term. (0 0)(0 0)(0 0)..........(0 0)(4 6)(5 3)(6 1)(9 1)(10 2)(12 1)...... ...(6791 2)(6794 3)(6796 2)(6798 16)(6801 1)(6805 1)(6806 5) (6808 1)(6810 1)(6815 2)(6816 3)(6817 1)(6818 1)(6821 4)(6822 1) (6824 3)(6826 4)(6828 1)(6829 3)(12549 2)doc exceeds count 749(13570 1)doc exceeds count 7749(14896 1)doc exceeds count 7749 (15028 1)doc exceeds count 7749(15357 1)doc exceeds count 7749 (15427 1)doc exceeds count 7749(15534 1)doc exceeds count 7749 (15535 1)doc exceeds count 7749(15653 1)doc exceeds count 7749 (16530 1)doc exceeds count 7749(17108 1)....... (c) Also the zero frequency was not limited to the 0 document, there was another instance.
One work around that seemed to resolve the issue was to: (a) keep the maxDoc as a member variable in SegmentMergeInfo and ignore/throw exception if an inconsistent state is detected.
****ADD To SegmentMerger just before "docs out of order" check. if (postings.freq() == 0) { continue; } if (doc >= smi.maxDoc) { //sbLog.append("doc exceeds count \r\n " + smi.maxDoc); continue; } ****
Atleast putting a check would not corrupt the segments and would get us closer to the real problem as to why freq=0 and doc number exceeds maxDoc. Note, the code has had the fix to the other Segment corruption issue that I previously reported (namely, Using a zero length file).
Arvind Srinivasan (migrated from JIRA)
Created an attachment (id=15405) Analysis of corrupted segments and suggestions.
Otis Gospodnetic (migrated from JIRA)
Arvind, thanks for following up on this issue. From your report I can see that the index really ends up containing invalid data, but I've never even seen this happen myself. Could you please put together sample code that gets the index in this state?
Yonik Seeley (@yonik) (migrated from JIRA)
I've never seen this... can anyone reproduce with Lucene 1.9? CCing this to Arvind's email...
Otis Gospodnetic (@otisg) (migrated from JIRA)
2 years later, I still haven't seen this error.
Jarrod Cuzens (migrated from JIRA)
First I think that Lucene is great and it manages to do an incredible job. However, this issue is also causing us significant problems. We originally had an AOP interceptor that would update documents in our HTTP threads however when we started to see this issue we were concerned that it was caused by multiple threads accessing the index at the same time. We put extra concurrency controls on the updates using the LuceneIndexAccess API that was posted into bugzilla by another user. This issue still remained after we added the extra concurrency control.
Since then we have abandoned the AOP approach completely and moved the index rebuild to a scheduled thread which collects things that were modified from the database (using datestamp versioning) and then proceeds to update their lucene indexes. We hoped this would solve the problem because only 1 thread in 1 process would be modifying the index at any given time. Alas, we are still getting docs out of order exceptions. It is difficult for us to reproduce as it mainly happens in production and we cannot provide a testcase for it (I wish we could!).
I know that this must be a difficult issue because it is difficult to reproduce. I can't think of anything strange that we are doing with the indexes (one process, one thread modifying). This probably isn't much help but I just wanted to let you know that we are also experiencing the problem.
Doug Cutting (@cutting) (migrated from JIRA)
File corruption could cause this. Please look in your system logs to see if there are any reports of problems accessing the drive that stores this index.
Jarrod Cuzens (migrated from JIRA)
Hi Doug,
Thanks for your suggestion (and great work on Lucene!). I looked in the logs and could not find any indication of corruption. In addition we actually have the system running in a cluster where each node in the cluster has its own lucene search index. We had the issue on a different machine before we moved to the cluster and we now see it on both machines in our cluster :(. Next time it occurs I will get the indexes and try to post them here if you think that would be helpful.
Just some additional info: 1) We are running SUSE 10 Linux. 2) We run two quartz jobs: One job runs every 2 minutes and updates lucene with changes from the db if necessary. The other job runs at 3:30AM in the morning and does full index rebuilds. We use the LuceneIndexAccess api when working with the IndexReader/IndexWriter. Only one thread should modify the index at any time although it is possible that the full rebuild job and the update job execute concurrently they shouldn't both modify the index due to the LuceneIndexAccess API's synchronization. 3) Currently when doing searches we create a new IndexSearcher everytime a search is to be performed; We do not use the LuceneIndexAccess API.
When I refer to the LuceneIndexAccess API I am referring to the contribution by Mark Schreiber: http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip
Thanks for any help!
Rafal Majewski (migrated from JIRA)
Hi,
I have recently experienced the same problem on relase 1.4.3. It happended in production(more than once). Filesystem index directory is used. One application is accountable for indexing, another for searching the index(it also uses IndexReader for query rewriting). Access to index directory(which is singleton) is synchronized within each application's scope. These two applications create separate directory instances and access it independently. Unfortunately, the code doesn't lead to repeatable occurances of this exception. I assume that these two applications having not synchronized access to the index might couse the problem, but I have received information that it had also occured with the searching application being down.
Thanks in advance for any help.
Jarrod Cuzens (migrated from JIRA)
I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't really have any tools to help really diagnose corruption issues. At this point we are considering changing our system to do builds on one machine, test them, and then have them distributed to the other machines in our cluster.
If anybody could look at this it would be greatly appreciated! Thanks! Jarrod
Jarrod Cuzens (migrated from JIRA)
Second part. :)
Ales Justin (migrated from JIRA)
We used Lucene 1.4.3. for a half year now. But just out of the sudden this issue appeared.
We have synchronized access to index - synchronized singleton handling all write / read operations.
Is there any progress on finding the cause?
Jason Lambert (migrated from JIRA)
I was having this problem intermittently while indexing over multiple threads and I have found that the following steps can cause this error (with Lucene 1.3 and 1.4.x):
#2
#2
reader#2
If I ensure that the steps marked with an asterisk ⭐ do not happen (i.e. using thread synchronization), I never get this error. Otherwise, it will happen intermittently while closing the second IndexWriter (#2) (sometimes I get an ArrayIndexOutOfBoundsException during the deletion). These 'extra' writes cause the initial 'segments' file to be updated after which it is re-read while opening the second IndexReader (#2).
This can end up deleting documents using possibly non-existent IDs, most likely causing the index corruption that this error signals.
Jed Wesley-Smith (migrated from JIRA)
We have seen this one as well. We don't have the same usage as above, we only ever delete documents with IndexReader.deleteDocuments(Term)
We are using Lucene 1.9.1
It occurs in two places, inside IndexWriter.addDocument():
java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)
and inside IndexWriter.close():
java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.flushRamSegments()V(IndexWriter.java:628) at org.apache.lucene.index.IndexWriter.close()V(IndexWriter.java:375)
The second one exposes a problem in the close() method which is that the index write.lock is not released when exceptions are thrown in close() causing subsequent attempts to open an IndexWriter to fail.
Michael McCandless (@mikemccand) (migrated from JIRA)
I just resolved #1681 as a dup of this issue.
It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!!
Jed Wesley-Smith (migrated from JIRA)
We have now seen this in a number of customer sites since upgrading JIRA to use Lucene 1.9.1. The JIRA report is here: http://jira.atlassian.com/browse/JRA-11861
We only seem to have seen it since the upgrade from 1.4.3 to 1.9.1, we hadn't seen it before then.
This is now a major issue for us, it is hitting a number of our customers. I am trying to generate a repeatable test for it as a matter of urgency.
As a follow-up we sometimes see the old ArrayIndexOutOfBoundsEx in BitVector.get() (BitVector.java:63)
will post more if I find something worth sharing.
Jed Wesley-Smith (migrated from JIRA)
and we also see ArrayIndexOutOfBoundsEx in the SegmentReader.isDeleted() method:
java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.index.SegmentReader.isDeleted(I)Z(Optimized Method) at org.apache.lucene.index.SegmentMerger.mergeFields()I(Optimized Method) at org.apache.lucene.index.SegmentMerger.merge()I(Optimized Method) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681)
Michael McCandless (@mikemccand) (migrated from JIRA)
OK: I finally found one way that this corruption can occur! I will create a unit test & commit a fix.
If you delete by document number, and, that document number is larger than maxDoc, but only by a little, then the call to deletedDocs.set(num) may in fact succeed (ie, no exception), but will have set bits that are "out of bounds" in the BitVector's bits array.
This is because the bits array is an array of bytes and so you can have up to 7 of these unused bits at the end. Once this has happened, any attempt to merge this segment will hit the "docs out of order" exception because the BitVector's count() method will count these "illegally set" bits and thus make the SegmentMerger think too many docs are deleted.
Unfortunately, this case only occurs if you use deleteDocument(int), so I can't yet explain how this happens when using only deleteDocument(Term).
Michael McCandless (@mikemccand) (migrated from JIRA)
I've committed a fix for this one case to the trunk.
I'm leaving the issue open so folks above can try the fix and confirm whether or not this fixes their cases.
Jed (or any other folks who have hit this above and are still listening!), the fix is really trivial and would be easy to back port to prior releases: just run "svn diff -r 494135:494136" from a Lucene checkout to see them.
If you are willing/able to try this in one of the environments where you keep hitting this issue, that would be awesome: if this is in fact your root cause, then you would see an ArrayIndexOutOfBoundsException at the point that the delete of a too-large docNum occurred (rather than silent corruption and the above exception much later that you now see); and if it's not your root cause after testing the fix, then we would know for sure to look for another cause here.
Are you sure that you only ever do IndexReader.deleteDocuments(Term) and not deleteDocuments(int docNum)? I still can't explain how this error could happen without using that second method.
Doron Cohen (migrated from JIRA)
Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things -
(1) the sequence of ops brought by Jason is wrong:
a Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
b Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
c Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads) ⭐
d Close the IndexWriter ⭐
e Open another IndexReader (#2) over the index
f Delete the previously found documents by their document ID numbers using reader #2
g Close the #2
reader
h Create another IndexWriter (#2) and re-add the updated documents
i Close the IndexWriter #2
j Close the original IndexReader (#1) and open a new reader for general searching
Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case.
(2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good. However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike!
I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass.
The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it.
public void testOutOfOrder () throws IOException { String tempDir = System.getProperty("java.io.tmpdir"); if (tempDir == null) { throw new IOException("java.io.tmpdir undefined, cannot run test: "+getName()); }
File indexDir = new File(tempDir, "lucenetestindexTemp");
Directory dir = FSDirectory.getDirectory(indexDir, true);
boolean create = true;
int numDocs = 0;
int maxDoc = 0;
while (numDocs <100) {
IndexWriter iw = new IndexWriter(dir,anlzr,create);
create = false;
iw.setUseCompoundFile(false);
for (int i=0; i<2; i++) {
Document d = new Document();
d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED));
iw.addDocument(d);
}
iw.optimize();
iw.close();
IndexReader ir = IndexReader.open(dir);
numDocs = ir.numDocs();
maxDoc = ir.maxDoc();
assertEquals(numDocs,maxDoc);
for (int i=7; i >=~~1; i~~-) {
try {
ir.deleteDocument(maxDoc+i);
} catch (ArrayIndexOutOfBoundsException e) {
}
}
ir.close();
}
}
Mike, do you agree?
(3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit).
This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657
Not sure if this can be the case here but at least we can discuss whether it is better to always store the length.
Jed Wesley-Smith (migrated from JIRA)
Hi Michael,
This is awesome, I have prepared a patched 1.9.1: http://jira.atlassian.com/secure/attachment/19390/lucene-core-1.9.1-atlassian-patched-2007-01-09.jar
Unfortunately we don't have a repeatable test for this so we will have to distribute to afflicted customers and - well, pray I guess. We have been seeing this sporadically in our main JIRA instance http://jira.atlassian.com so we will hopefully not observe it now.
We do only use the deleteDocuments(Term) method, so we are not sure whether this will truly fix our problem, but we note that that method calls deleteDocument(int) based on the TermDocs returned for the Term - and maybe they can be incorrect???
Out of interest, apart from changing from 1.4.3 to 1.9.1, in the JIRA 3.7 release we changed our default merge factor to 4 from 10. We hadn't seen this problem before, and suddenly we have had a reasonable number of occurrences.
Jed Wesley-Smith (migrated from JIRA)
Alas, this doesn't appear to be the problem. We are still getting it, but we do at least have a little more info. We added the doc and lastDoc to the IllegalArgEx and we are getting very strange numbers:
java.lang.IllegalStateException: docs out of order (-1764 < 0) at org.apache.lucene.index.SegmentMerger.appendPostings([Lorg/apache/lucene/index/SegmentMergeInfo;I)I(SegmentMerger.java:335) at org.apache.lucene.index.SegmentMerger.mergeTermInfo([Lorg/apache/lucene/index/SegmentMergeInfo;I)V(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.mergeTermInfos()V(SegmentMerger.java:272) at org.apache.lucene.index.SegmentMerger.mergeTerms()V(SegmentMerger.java:236) at org.apache.lucene.index.SegmentMerger.merge()I(SegmentMerger.java:89) at org.apache.lucene.index.IndexWriter.mergeSegments(II)V(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(I)V(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.maybeMergeSegments()V(IndexWriter.java:646) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;Lorg/apache/lucene/analysis/Analyzer;)V(IndexWriter.java:453) at org.apache.lucene.index.IndexWriter.addDocument(Lorg/apache/lucene/document/Document;)V(IndexWriter.java:436)
where doc = -1764 and lastDoc is zero
Michael McCandless (@mikemccand) (migrated from JIRA)
Jed, thanks for testing the fix!
> Alas, this doesn't appear to be the problem. We are still getting > it, but we do at least have a little more info. We added the doc and > lastDoc to the IllegalArgEx and we are getting very strange numbers: > > java.lang.IllegalStateException: docs out of order (-1764 <0) > ... > > where doc = -1764 and lastDoc is zero
OK, so we've definitely got something else at play here (bummer!). That (negative doc number) is very strange. I will keep looking a this. I will prepare a patch on 1.9.1 that adds some more instrumentation so we can get more details when you hit this ...
> We do only use the deleteDocuments(Term) method, so we are not sure > whether this will truly fix our problem, but we note that that > method calls deleteDocument(int) based on the TermDocs returned for > the Term - and maybe they can be incorrect???
Right, but I had thought the docNum's coming in from this path would be correct. It sounds like we have another issue at play here that can somehow get even these doc numbers messed up.
> Out of interest, apart from changing from 1.4.3 to 1.9.1, in the > JIRA 3.7 release we changed our default merge factor to 4 from > 10. We hadn't seen this problem before, and suddenly we have had a > reasonable number of occurrences.
Interesting. Maybe try changing back to 4 and see if it suppresses the bug? Might give us [desperately needed] more data to cling to here! On the 1.4.3 -> 1.9.1 change, some of the cases above were even pre-1.4.x (though they could have been from yet another root cause or maybe filesystem) so it's hard to draw hard conclusions on this front.
Michael McCandless (@mikemccand) (migrated from JIRA)
Doron,
> (1) the sequence of ops brought by Jason is wrong: > ... > > Problem here is that the docIDs found in (b) may be altered in step > (d) and so step (f) would delete the wrong docs. In particular, it > might attempt to delete ids that are out of the range. This might > expose exactly the BitVector problem, and would explain the whole > thing, but I too cannot see how it explains the delete-by-term case.
Right, the case I fixed only happens when the Lucene deleteDocument(int docNum) is [slightly] mis-used. Ie if you are "playing by the rules" you would never have hit this bug. And this particular use case is indeed incorrect: doc numbers are only valid to the one reader that you got them from.
> I think however that the test Mike added does not expose the docs > out of order bug - I tried this test without the fix and it only > fail on the "gotException assert" - if you comment this assert the > test pass.
Huh, I see my test case (in IndexReader) indeed hitting the original "docs out of order" exception. If I take the current trunk and comment out the (one line) bounds check in BitVector.set and run that test, it hits the "docs out of order" exception.
Are you sure you updated the change (to tighten the check to a <= from a <) to index/SegmentMerger.java? Because, I did indeed find that the test failed to fail when I first wrote it (but should have). So in digging why it didn't fail as expected, I found that the check for "docs out of order" missed the boundary case of the same doc number twice in a row. Once I fixed that, the test failed as expected.
> (3) maxDoc() computation in SegmentReader is based (on some paths) > in RandomAccessFile.length(). IIRC I saw cases (in previous project) > where File.length() or RAF.length() (not sure which of the two) did > not always reflect real length, if the system was very busy IO wise, > unless FD.sync() was called (with performance hit).
Yes I saw this too. From the follow-on discussion it sounds like we haven't found a specific known JVM bug here. Still, it does make me nervous that we rely on file length to derive maxDoc.
In general I think we should rely on as little as possible from the file system (there are so many cross platform issues/differences) and instead explicitly store things like maxDoc into the index. I will open a separate Jira issue to track this. Also I will record this path in the instrumentation patch for 1.9.1 just to see if we are actually hitting something here (I think unlikely but possible).
Michael McCandless (@mikemccand) (migrated from JIRA)
OK, I created #1842 for the "maxDoc should be explicitly stored in the index" issue.
Michael McCandless (@mikemccand) (migrated from JIRA)
Jed, one question: when you tested the fix, you fully rebuilt your index from scratch, right? Just want to verify that. You have to re-index because once the index is corrupted it will eventually hit the "docs out of order" exception even if you fix the original cause.
OK I've prepared a patch off 1.9.1 (just attached it). The patch passes all unit tests on 1.9.1.
It has the changes I committed to the trunk yesterday, plus instrumentation (messages printed to a PrintStream) to catch places where doc numbers are not correct.
All messages I added print to a newly added infoStream static member of SegmentMerger. You can do SegmentMerger.setInfoStream(...) to change it (it defaults to System.err).
Jed if you could get the error to re-occur with this patch and then post the resulting messages, that would be great. Hopefully it gives us enough information to find the source here or at least to have another iteration with yet more instrumentation. Thanks!
Jed Wesley-Smith (migrated from JIRA)
Hi Michael,
Thanks for the patch, applied and recreated. Attached is the log.
To be explicit, we are recreating the index via the IndexWriter ctor with the create flag set and then completely rebuilding the index. We are not completely deleting the entire directory. There ARE old index files (*.cfs & *.del) in the directory with updated timestamps that are months old. If I completely recreate the directory the problem does go away. This is a fairly trivial "fix", but we are still investigating as we want to know if this is indeed the problem, how we have come to make it prevalent, and what the root cause is.
Thanks for all the help everyone.
Jed Wesley-Smith (migrated from JIRA)
BTW. We have looked at all the open files referenced by the VM when the indexing errors occur, and there does not seem to be any reference to the old index segment files, so I am not sure how those files are influencing this problem.
Doron Cohen (migrated from JIRA)
Jed, is it possible that when re-creating the index, while IndexWriter is constructed with create=true, FSDirectory is opened with create=false? I suspect so, because otherwise, old .del files would have been deleted. If indeed so, newly created segments, which have same names as segments in previous (bad) runs, when opened, would read the (bad) old .del file. This would possibly expose the bug fixed by Michael. I may be over speculating here, but if this is the case, it can also explain why changing the merge factor from 4 to 10 exposed the problem.
In fact, let me speculate even further - if indeed when creating the index from scratch, the FSDirectory is (mistakenly) opened with create=false, as long as you always repeated the same sequencing of adding and deleting docs, you were likely to almost not suffer from this mistake, because segments created with same names as (old) .del files simply see docs as deleted before the docs are actually deleted by the program. The search behaves wrongly, not finding these docs before they are actually deleted, but no exception is thrown when adding docs. However once the merge factor was changed from 4 to 10, the matching between old .del files and new segments (with same names) was broken, and the out-of-order exception appeared.
...and if this is not the case, we would need to look for something else...
Michael McCandless (@mikemccand) (migrated from JIRA)
OK from that indexing-failure.log (thanks Jed!) I can see that indeed there are segments whose maxDoc() is much smaller than deleteDocs.count(). This then leads to negative doc numbers on merging these segments.
Jed when you say "there are old files (*.cfs & *.del) in this directory with updated timestamps that are months old" what do you mean by "with updated timestamps"? Which timestamp is months old and which one is updated?
OK, assuming Jed you are indeed sending "create=false" when creating the Directory and then passing that directory to IndexWriter with create=true, I think we now have this case fully explained (thanks Doron): your old _*.del files are being incorrectly opened & re-used by Lucene, when they should not be.
Lucene (all released versions but not the trunk version, see below) does a simple fileExists("_XXX.del") call to determine if a segment XXX has deletes.
But when that _XXX.del is a leftover from a previous index, it very likely doesn't "match" the newly created _XXX segment. (Especially if merge factor has changed but also if order of operations has changed, which I would expect in this use case).
If that file exists, Lucene assumes it's for this segment and so opens it and uses it. If it happens that this _XXX.del file has more documents in it than the newly created _XXX.cfs segment, then negative doc numbers will result (and then later cause the "docs out of order" exception). If it happens that the _XXX.del file has fewer documents than the newly created _XXX.cfs segment then you'll hit ArrayIndexOutOfBounds exceptions in calls to isDeleted(...). If they are exactly equal then you'd randomly see some of your docs got deleted.
Note that the trunk version of Lucene has already fixed this bug (as part of lockless commits):
Whether a segment has deletions or not is now explictly stored in the segments file rather than relying on a "fileExists(...)" call. So, if an old _XXX.del existed in the filesystem, the newly created _XXX segment would not open it.
Furthermore, the trunk version of Lucene uses a new IndexFileDelter class to remove any unreferenced index files. This means it would have removed these old *.cfs and *.del files even in the case where a directory was created with "create=false" and the IndexWriter was created with "create=true".
To summarize:
There was one case where if you gave slightly illegal doc numbers (within 7 of the actual maxDoc) Lucene may silently accept the call but would corrupt your index only to be seen later as an "docs out of order" IllegalStateException when the segment is merged. This was just a missing boundary case check. This case is now fixed in the trunk (you get an ArrayIndexOutOfBoundsException if doc number is too large).
There is also another case, that only happens if you have old _*.del files leftover from a previous index while re-creating a new index.
The workaround is simple here: always open the Directory with create=true (or, remove the directory contents yourself before hand). (IndexWriter does this if you give it a String or File with create=true).
This is really a bug in Lucene, but given that it's already fixed in the trunk, and the workaround is simple, I'm inclined to not fix it in prior releases and instead publicize the issue (I will do so on java-user).
But, I will commit two additional IllegalStateException checks to the trunk when a segment is first initialized: 1) check that the two different sources of "maxDoc" (fieldsReader.size() and si.docCount) are the same, and 2) check that the number of pending deletions does not exceed maxDoc(). When an index has inconsistency I think the earlier it's detected the better.
Michael McCandless (@mikemccand) (migrated from JIRA)
> BTW. We have looked at all the open files referenced by the VM when > the indexing errors occur, and there does not seem to be any reference > to the old index segment files, so I am not sure how those files are > influencing this problem.
Jed just to answer this question: the _XXX.del files are opened very briefly because the contents of this file are loaded / cached in memory, and the the file handle is closed. I don't think the _XXX.cfs files are affecting this issue (are not opened).
Jed Wesley-Smith (migrated from JIRA)
Michael, Doron, you guys are legends!
Indeed the problem is using only the IndexWriter with create true to recreate the directory. Creating a new Directory with create true does fix the problem. The javadoc for this constructor is fairly explicit that it should recreate the index for you (no caveat), so I would consider that a bug, but - given that head fixes it - not one that requires any action.
Thanks guys for the prompt attention, excellent and thorough analysis.
Michael McCandless (@mikemccand) (migrated from JIRA)
Phew! I'm glad we finally got to the bottom of this one.
Thank you for your persistent and fast testing iterations, Jed; this issue has been open for far too long!
I will send a summary email to java-user and resolve this issue, finally.
Michael McCandless (@mikemccand) (migrated from JIRA)
Actually, this reminds me that, as of lockless commits, there is one important tradeoff on which "create=true" to use (the case on windows where you want to re-create the index but readers are currently using it). I will call out this difference in the javadocs.
Although, why do we even have a "create" parameter in the directory? I think it's confusing (and dangerous, pre-trunk, due to this issue) to have two ways of doing the same thing?
Logically, I don't think a Directory should take the responsibility of deleting old files (including old lock files). It should be a clean interface for doing so, but I think the IndexWriter alone should be the class that deletes files from the directory.
With lockless commits this has become an important difference, ie, the new IndexFileDeleter class (used by IndexWriter) handles re-trying files that are in-use (on Windows) whereas FSDirectory will throw an exception if create=true and there are index files are in use.
I think we should deprecate the "create" argument to FSDirectory.getDirectory(*) and leave only the create argument in IndexWriter's constructors. Am I missing something? Is there are a reason not to do this?
Michael McCandless (@mikemccand) (migrated from JIRA)
Resolving this now, finally (I'll move the two "create" arguments discussion to a separate issue if we decide to go forward with that):
Fixed (in the trunk) to catch boundary cases of incorrect docNum's to deleteDocuments.
Fixed (in the trunk) to do earlier "IllegalState" checks to catch index corruption sooner. Also fixed the existing IllegalState check to catch missing boundary cases.
The re-using of old _XXX.del files is already fixed with lockless commits (in trunk). This remains open for past releases, but the workaround is simple and I've now publicized this on java-user.
Michael McCandless (@mikemccand) (migrated from JIRA)
Closing all issues that were resolved for 2.1.
Hello, I can not find out, why (and what) it is happening all the time. I got an exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) at Optimize.main(Optimize.java:29)
It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find it neither in download nor in version list in this form). Everything seems OK. I can search through index, but I can not optimize it. Even worse after this exception every time I add new documents and close IndexWriter new segments is created! I think it has all documents added before, because of its size.
My index is quite big: 500.000 docs, about 5gb of index directory.
It is repeatable. I drop index, reindex everything. Afterwards I add a few docs, try to optimize and receive above exception.
My documents' structure is: static Document indexIt(String id_strony, Reader reader, String data_wydania, String id_wydania, String id_gazety, String data_wstawienia) { Document doc = new Document();
}
Sincerely, legez
Migrated from LUCENE-140 by legez, 5 votes, resolved Jan 11 2007 Environment:
Attachments: ASF.LICENSE.NOT.GRANTED--bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch