Track FieldInfo per segment instead of per-IW-session [LUCENE-2881]

asfimport commented 13 years ago

Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over.

The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments.

We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.

Migrated from LUCENE-2881 by Simon Willnauer (@s1monw), resolved Mar 30 2011 Attachments: lucene-2881.patch (versions: 5), LUCENE-2881.patch (versions: 4) Linked issues:

3400
- 4057

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.

Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.

i think we should do that on trunk and then merge to RT - do you have time to work on this soon?

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

It would probably make sense to have a new class (maybe an extension of SegmentInfo) for in-memory (not-yet-flushed) segments that references the corresponding FieldInfos and SegmentDeletes. That'd be better I think that adding another map SegmentInfo -> FieldInfos and we could then also remove the SegmentInfo -> SegmentDeletes map (in BufferedDeletes).

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

i think we should do that on trunk and then merge to RT - do you have time to work on this soon?

Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.

Michael, if you start something I can work on it tomorrow for a while too. That way we can get it in quickly ;)

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

* Creates for every segment a new FieldInfos

Changes FieldInfos, so that the FieldInfo numbers within a single FieldInfos don't have to be contiguous - this allows using the same numbering as the previous segment(s), even if not all fields are present in the new segment
Adds a global fieldName -> fieldNumber map; if possible when a new field is added to a FieldInfo it tries to use an already assigned number for that field

All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Michael, this looks very good!

All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.

I agree we should make FieldInfos a member of SegmentInfo and remove the hasVectors / hasProx an push them down. I tried to apply this patch to the DocValues branch but I didn't get very far - I haven't merged for a while that killed me ;( I need to push that to after my vacation. I really hoped that I can get further but it didn't work out ;(

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos.

For backwards-compatibility reasons we can't completely remove the hasVectors and hasProx bytes from the serialized SegmentInfo yet. Eg. if someone uses addIndexes(Directory...) to add external old pre-4.0 segments to a new index, we "upgrade" the SegmentInfo to the latest version. However, we don't modify the FieldInfos of that segment, instead we just copy it over to the new dir. So the hasVector and hasProx bits in the FieldInfos might not be accurate and we have to keep those bits in the SegmentInfo instead. Not an ideal solution, but we can remove it entirely in Lucene 5.0 :). The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

All core & contrib tests pass.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos.

Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once for DocValues branch.

bq.The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

I think copying over is fine. Ideally we will move all those boolean etc to the codec level so that we don't need that at all. Once stored fields and vectors are written by the codec we can push all that into PreFlex codec (maybe!?) and get rid of the bw compat code.

I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I gave the patch another glance - here are a couple of very minor comments:

maybe we should return Iterable<FieldInfo> from FieldInfos#getFieldInfoIterator() this would make the iteration syntactically more javaish {code}

for (FieldInfo info : getFieldInfoIterator()) { // do something with it }

{code}

If we return Iterable<FieldInfo> should we name it getFieldInfos?
Maybe we can simply implement Iterable<FieldInfo>?
Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply SI#clear() this would make thing less coupled to the SI internal cache but rather something that is called to clear internal state after flush?

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Thanks for reviewing!

I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

I just want to add another tests for the global fieldname->number map, after that I think it'll be ready to commit. Will do that tonight :)

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Maybe we can simply implement Iterable<FieldInfo>?

good idea - done.

Maybe we can rename SI#clearFilesCache()

Actually I renamed it intentionally, because all this method does is really clearing the files cache. SI has a separate reset() method for resetting its state entirely.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

New patch that adds a new junit for testing that field numbering is consistent across segments. It tests two cases: 1) one IW is used to write two segments; 2) two IWs are used to write two segments.
And it also tests that addIndexes(Directory...) doesn't mess up the field numbering of the external segment.

All tests pass. I'll commit this in a day or two if nobody objects.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

All tests pass. I'll commit this in a day or two if nobody objects.

+1

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I just integrated this patch to the docvalues branch! It works like a charm! Nice work Michael, this brings docValues a huge step closer. All tests pass which failed before, FieldInfo is reliable now!

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Awesome, thanks for letting me know! I hope I'll be able to say the same about the RT branch after I tried it there... :)

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I actually have some intermitted failures on docvalues which seem to be caused by some mixed up codec IDs. I found one which fixed most of the cases in FieldInfo#clone() where the codec ID is lost. But there must be some other situation where the codec ID gets mixed up. I will be AFK for 10 days at least so I have to leave you alone with that. Seems that we need a test for that though. The docValues test bring that up quickly :(

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon.

About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).

All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon

ah that could be the reason. I will need to patch my branch again to see if your patch helps. Will do tomorrow.

About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).

The clone issue seems to be fixed in your latest patch. While the setter seems kind of wrong. Lemme explain how that numbering works. If you create a SI for a flushed segment SegmentCodecs creates an ordinal for each codec and sets it to the corresponding fields. The ordinal (codecID) is the array index in the SegmentCodec's Codec array which holds the codec instance used for that field. So codecID = 0 is a valid value for segments having PreFlex or a >= 4.0 codec. But if we open a segment that is pre-flex there will only be one codec for the entire segment with codecID=0, thats why this is assigned. (Note: I need to document this where its set!) I think we should initialize the codecID with a different value and replace the this.codecId ![= 0 check with something like this.codecId ](= 0 check with something like this.codecId )= -1. Maybe we should just use an assert here instead of an exception, this is somewhat internal though.

All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.

Will do tomorrow! What exactly was the problem with the previous patch beside the codecID clone issue?

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1.

Yeah, I had the same though. I changed it to use -1 and use an assert now instead of throwing the exception. (will post the new patch shortly)

What exactly was the problem with the previous patch beside the codecID clone issue?

Not sure if that's what caused your codecID issues, but the previous patch had a problem with assigning field numbers. It could happen that a global number for a FieldInfo was acquired, but that number wasn't available anymore in the local FieldInfos. I think this would be quite rare, but now I'm preventing this from happening.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Uses -1 now as initial value for codecID.
updated to current trunk

Let me know if it works without problems now in the doc values branch, Simon!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Uses -1 now as initial value for codecID.

alright that looks good!

Let me know if it works without problems now in the doc values branch, Simon!

this looks good now. I think we need to add a CHANGES.TXT entry for this issue since it changes the runtime behavior. +1 to commit Nice work!

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Committed revision 1073110.

Thanks for reviewing the patch, Simon!

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

If an app indexes docs where each doc has different fields... does this mean we are not going to reclaim the DFPPerField for the fields that are no longer seen after flushing? Ie, would it be a memory leak? Somewhere we have a test case for this I think.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?

Ahh, good, I think you're right! I missed that we clear this on flush.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

I mentioned on dev that assigning the same field number across segments is best effort now and wanted to explain in greater detail here how it works:

There is now a global fieldName <-> fieldNumber bi-map in FieldInfos, which contains all fieldName/number pairs seen in a IndexWriter session. It is passed into each new FieldInfos that is created in the same IndexWriter session.

Also, when a new IndexWriter is opened, the FieldInfos of all segments are read and the global map created - this is tested in a new unit test this issue adds.

A FieldInfos has in addition to the reference to the global map also a "private" map, which holds all FieldInfo objects that belong to the corresponding segment (remember there's now a 1-1 mapping SegmentInfo->FieldInfos).

Now the fieldNumber assignment strategy works as follows: If a new FI is added to FieldInfos, the global map is checked for the number of that field. If the field name hasn't been seen before, the smallest number available in the local map is picked (to keep the numbers dense).
Otherwise, if we have seen the field before, the global number is used. The problem now might be, that the global number might already be taken in the local FieldInfos. In this case the global and local numbers for the same fieldName would differ. This is not a problem in terms of correctness, but could prevent that field from being efficiently bulk-merged.

With DocumentsWriterPerThreads (DWPTs) in mind I don't see how we could guarantee consistent field numbering across DWPTs, that's why I implemented it in this "best effort" way.

Here's an example on how we can get into a situation where a field would get different numbers in different segments: segment_1 has fields A and B, therefore these mappings A -> 1, B -> 2. Now in segment_2 the first field we add is C, which hasn't been seen ever before, so we pick locally number 1 for it. Then we add the next document which has field A, but since number 1 is already taken, it would get a different number than in segment_1. This means A would not get bulk merged.

Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Thanks for the clarification, michael!

... the smallest number available in the local map is picked (to keep the numbers dense). ...

Oh man I was not aware of this. I got totally confused... see next comment

Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?

This is how I thought it works but I obviously got confused by global vs. local this is also why I had trouble to understand how that failure could ever have happened. But after looking at the code again this makes sense ;) I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Reopening to make the described improvement that ensures consistent field numbers.

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.

I was thinking that we probably write field numbers as VInts in a lot of places, and it would therefore be less efficient to have gaps... but this is probably negligible.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This was an impressive change :) Not only did we shift FieldInfos under SegmentInfo, we also fixed a given FieldInfos to allow for sparse mapping (ie, it only contains certain field numbers). Various places previously assumed this was a dense mapping.

I think even in the DWPT case we should try to assign consistent field numbers? I'm already worried enough about the possible perf hit DWPT will put on normal indexing and the NRT case, since it basically pushes merging out from RAM and onto the "normal" more costly disk based merging. If we also suddenly risk these merges not being bulk merges, that's even worse.

Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case?

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case?

Yes. The global map is already shared across DWPTs and the lookup is synchronized on the global map. I think if we change the logic to always pick the next available global number we would increase the likelihood that fields get bulk-merged.

It can't be perfect though, because e.g. if you use addIndexes() to add an external segment that has a different field number assignment. That's why we definitely have to keep the code that can fallback to a different local number if it's not possible to use the global number in a segment. But I agree that we should optimize for the "normal" indexing case. And it seems like we all agree that field number gaps are fine.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

For the record, robert reverted the changes made by this issue since we have been experiencing a fair bit of problems lately.

eventually reproducible with:

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValued -Dtests.seed=-4971136915249645135:5200209917417531291 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValuedFcs -Dtests.seed=-4971136915249645135:-3738166620811568832 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixMultiValued -Dtests.seed=-4971136915249645135:4594369826150277150 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValued -Dtests.seed=-4971136915249645135:-7702531001769827248 -Dtests.multiplier=3

ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValuedFcs -Dtests.seed=-4971136915249645135:698398490325732548 -Dtests.multiplier=3

I found the problem causing this where certain field numbers got mixed up when the FieldInfos get build initially in IndexWriter and a segment is loaded first which had gaps in its field numbering. FieldInfos is ignoring the FieldInfo's number if the FieldInfo does not exist yet and tries to assigne a new "local" field number. But if the next available field number x while the actual FI's number was > x+1 the new added FI will be set to x instead.

in other words, lets say we have 2 segments:

 seg1 : { fields : [(a:0, c:2)] } 
 seg2 : { fields : [(a:0, b:1, c:2)] }

if we load seg1's FI we end up with

fields : [(a:0, c:1)]

then we add seg2's FI's and end up with

fields : [(a:0, c:1, b:2)]

this will also explain the TestNRTThreads.testNRTThreads failure where bulkMerge could not be applied due to different field numbers across segments.

I will upload a patch tomorrow.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Here is an updated patch against trunk that fixes the problems we had with SimpleFacetsTest. I added a testcase that triggered anyNonBulkMerges to be false as well as field number to be mixed up across segments. This patch also tracks the fieldNumbers globally within a IW session.

I have been running tests with while(true) { ant clean test } on a top level directory without any failure for 12 hours now. Aside of that the testcase I added failed reliably with the previous version and it explains all failures we have see so far especially the one with NRTThreads.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did...

I'm also seeing this in TestIndexWriterExceptions:

NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testDocumentsWriterExceptions -Dtests.seed=-1056365072156857559:-3700694839465596125

There was 1 failure:
1) testDocumentsWriterExceptions(org.apache.lucene.index.TestIndexWriterExceptions)
java.io.FileNotFoundException: _1.tvx
    at org.apache.lucene.store.RAMDirectory.fileLength(RAMDirectory.java:147)
    at org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:524)
    at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:290)
    at org.apache.lucene.index.LogMergePolicy.sizeBytes(LogMergePolicy.java:211)
    at org.apache.lucene.index.LogByteSizeMergePolicy.size(LogByteSizeMergePolicy.java:45)
    at org.apache.lucene.index.LogMergePolicy.useCompoundFile(LogMergePolicy.java:165)
    at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:629)
    at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2538)
    at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2503)
    at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1077)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1041)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1005)
    at org.apache.lucene.index.TestIndexWriterExceptions.testDocumentsWriterExceptions(TestIndexWriterExceptions.java:565)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
    at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213)
    at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at org.junit.runners.Suite.runChild(Suite.java:128)
    at org.junit.runners.Suite.runChild(Suite.java:24)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
    at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
    at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
    at org.junit.runner.JUnitCore.main(JUnitCore.java:45)

FAILURES!!!

The failure reproduces for me (with the above seed)...

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is a big patch! Some comments...:

Do we even use FieldInfos.add(Document)? Maybe we can remove it...
FieldNumberBiMap.addOrGet doesn't need to take the FieldInfoBiMap? (It's an unused param).
Why create the FieldInfoBiMap class? Ie, why not "merge" that back into FieldInfos itself (like it used to be)? (I understand why we need FieldNumberBiMap – so we can share a single instance across FieldInfos).
We mix up autoboxing then unboxing, eg FieldInfos.addOrUpdateInternal takes int preferredFieldNumber, which is boxed when calling localFieldInfos.nextFieldNumber, then manually unboxed on calling globalFieldNumbers.addOrGet.
On working with a pre-4.0 index that has non-congruent assignments across segments... I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init? Ie, when you first open 4.0 IW on pre-4.0 index, you'll compute a certain global map, and write new segs with those bindings. But say some fields differed in their assignment... but then some of those conflicting segs are merged. Later, when you open the IW again, you'll get a different global map? And write new segments conflicting with the previous new segments you had written?
The fact that SegmentInfo.clearFilesCache is now a public API and consumer is responsible for knowing when to call this is... spooky. Previously this cache was a fully private thing (ie invalidated whenever a change was made to the SegmentInfo). But I don't see any way around it; since we now embed a FieldInfos and that FieldInfos (hasVectors) could change... maybe, whenever we call this method, add a comment explaining why?
It makes me nervous that the API that's allowed to pick a new field number (FieldInfo.addInternal) is the same API that used when reading a FieldInfos from _X.fnm (when we better not pick a different field number!). In theory of course those field numbers will never conflict and we'll always get our preferred field number... but still. Maybe add an assert in FieldInfos.read() verifying we always get that field number?
The call to localFieldInfos.setIfNotSet in FieldInfos.addInternal makes me nervous... is it actually possible for it to already be set, to a conflicting binding? Shouldn't it always match? (Ie, above, in addOrUpdateInternal, we just consulted the global map to get the binding). Can't we assert the global binding is either not present (and we add it) or if it is present it "matches"?
Why do we have FieldInfos.clearVectors? Nobody should call that...?
It's not great that we open then close the CFS reader inside SegmentInfo, just to read the FieldInfos. Ie, this means on opening an SR we will open this CFS reader twice... it also means that opening a SegmentInfos is quite a bit more costly than it used to be. EG creating an IW must now go and open/close a CFS reader per-segment... not sure what we can really do about that though... maybe, we should store the FieldInfos inside the segments file? Hmmm....
Shouldn't IndexWriter.getFieldInfos(SegmentInfo) use the SegmentInfo's fieldInfos rather than loading it again from the directory...?

Indentation is also off in various places, for us lonely people who still use Emacs ;)

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

maybe, we should store the FieldInfos inside the segments file? Hmmm....

I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format :)

I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init?

We could also store the global map on disk? addIndexes() would have to ignore the global map from the external index(es).

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format

Human-readable format would be sweet :)

Though I'm still generally nervous that this means just opening the segments file will become quite a bit more costly. Apps that have many fields will be especially penalized, though, apps really should not be creating so many fields.

We could also store the global map on disk?

That's an interesting idea? That'd ensure stability on the bindings, even for pre-4.0 indices. This way a pre-4.0 index would gradually work itself towards being fully consistent...

addIndexes() would have to ignore the global map from the external index(es).

Well, addIndexes(IR[]) would get fully remapped to the correct bindings? (since it's a real merge).

But, yes, addIndexes(Dir[]) would not – they are just file-copied. Hmm, they'd also presumably have a different global map, so if we stored the index global map in the Directory, how would we resolve conflicts on the incoming addIndexes...? I guess the local mapping for the incoming segments would override the global one, on conflict.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did...

this is a problem we have seen before. This is an exception where we write a single doc segment and fail inverting the doc after FI has been updated. Since hasVectors has been moved to FI out of SI this is inconsistent and the TV tries to open the files since hasVectors is true.

This is also why buschmi added clearVectors (just as a workaround afaik)

Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

man I still try to trigger it in isolation but I can spend too much time on that right now. Got sucked into Tiered Flushing ;)

About the other comments (the big bulletpoint list) - in the meanwhile I think we should split this up in several smaller issues.

start with resetting the FI after flush to make the flags consistent for each segment
move FIs into SI with hasVectors etc still in SI
factor hasVectors out of SI into FI
introduce a global field num map (maybe on realtime first?) and / or store global map
... We can still use this patch as a PoC does that make sense?

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?

I think we have three options here: 1) Store FIs inside of compound file (current patch) 2) Store FIs in SI 3) Store FIs in own file outside of compound file

Each has disadvantages:

1) more expensive to open SegmentInfos, as it now has to open cfs files to load FIs 2) SI files becomes bigger 3) one more file descriptor per segment - but can be closed as soon as FIs was read into memory

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Attached next iteration.

merged FieldInfoBiMap into FieldInfos
fixed AutoBoxing issues
reverted the deprecation of SegmentInfo#hasVector / hasProx since they where causing many failures due to missing files on exceptions. This also allowed me to make SegmentInfo.clearFilesCache private again. *the FieldInfoBiMap now asserts that if you add a field via setIfNotSet that if either the number or the name exists they are the same as in the mapping.
added lazy loading of FieldInfos in SegmentInfo to prevent opening the CFS file while opening SegmentInfo. SR#CoreReader now passes in the already opened CFSReader to load the FieldInfos
when we open a IW we load the BiMap directly from the SegmentInfos
SegmentInfos now owns the FieldInfoBiMap and writes the map upon commit into an additional .fnx file outside of the CFS file. The SegmentInfos persist the bimap only if it has changed otherwise references the previously written map.

I still have one nocommit since I want to add more tests for the persisted global map. Beside that all tests pass and I would appreciate a review though. Mike would you be so kind and let beast break it?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike would you be so kind and let beast break it?

Beast ran Lucene's core+contrib tests 398 times!

There was one failure, but I suspect it's unrelated. I'll open a separate issue...

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Beast ran Lucene's core+contrib tests 398 times!

awesome. Thanks for letting it chew on that patch for a while.

I added more tests to ensure field numbers are stable and some jdoc comments to package private apis. This patch also (hopefully) solves all indentation issues.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks much better – I think it's close! Comments:

I think we should put header (id + version, ie CodecUtil.write/readHeader) on fnx file?
When does FieldNumberBiMap init from another...? (We have ctor that takes a FieldNumberBiMap in... but I don't see who uses that).
Who uses BiMap.entries()? Looks like just for testing? Can you add comment saying "just for testing.."?
FieldInfos ctor that loads from index file name but makes a new bimap seems spooky...? Is this only used by tests now...?
The passed in "version" to SIS.readGlobalFieldMap is unused
I'm a little worried that we name the new file _X.fnx, because it will appear that this file 'belongs' to segment X, which is dangerous because in some recovery cases we will remove all files associated w/ a given segment (ie, X.*). Maybe, we can name it without the leading ? Ie, 0.fnx, 1.fnx, etc.?
In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be moved inside SegmentInfos? Ie it's a bit weird now how we first pull an (empty) biMap from the legacy SegmentInfos then we go and populate it ourselves...? Then we don't need this method in IW (we just ask the SIS for the bimap).
Should FieldInfo.putInternal(FieldInfo) assert that the global bimap "matches"? (Ie, it makes no effort to update the global bimap).
The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't needed; we pass hasProx to the ctor of that si.
Why do we default SegmentInfos.format now...? Seems spooky?
In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1?

Should we backport this to 3.x (after sufficient aging)?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think we should put header (id + version, ie CodecUtil.write/readHeader) on fnx file?

man, I told myself to add it about 10 times during that patch :)

When does FieldNumberBiMap init from another...?

ah legacy

I'm a little worried that we name the new file _X.fnx, because it will appear that this file 'belongs' to segment X, which is dangerous because in some recovery cases we will remove all files associated w/ a given segment (ie, X.*). Maybe, we can name it without the leading ? Ie, 0.fnx, 1.fnx, etc.?

right, good catch! that we can simple remove and then it should be clear that its not a file belonging to a certain segment

In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be....

yeah thats is where is should be though. I will move it in.

The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't...

true I will remove

Why do we default SegmentInfos.format now...? Seems spooky?

this hasn't been used in SIS before so I think it didn't matter before. Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....

In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1

ah good catch! thanks

I will fix those issues and upload another patch. Thanks mike for reviewing!!!!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Should we backport this to 3.x (after sufficient aging)?

I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Why do we default SegmentInfos.format now...? Seems spooky?

this hasn't been used in SIS before so I think it didn't matter before. Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....

Ahh, I see: it's for the case where we make a new SIS() in RAM, because we'll now look @ the format in files(). OK this sounds right then.

Should we backport this to 3.x (after sufficient aging)?

I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.

Definitely let it bake!

Also, I have lots of pending backports to 3.2... which this patch likely overlaps on, so we should try to do them "in order" to reduce conflicts I think.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, but: if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...?

apache / lucene

Track FieldInfo per segment instead of per-IW-session [LUCENE-2881] #3955

3400

4057