apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Make IndexReader really read-only in Lucene 4.0 [LUCENE-3606] #4680

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

As we change API completely in Lucene 4.0 we are also free to remove read-write access and commits from IndexReader. This code is so hairy and buggy (as investigated by Robert and Mike today) when you work on SegmentReader level but forget to flush in the DirectoryReader, so its better to really make IndexReaders readonly.

Currently with IndexReader you can do things like:


Migrated from LUCENE-3606 by Uwe Schindler (@uschindler), resolved Dec 09 2011 Attachments: LUCENE-3606.patch, LUCENE-3606-deprecations3x.patch (versions: 3), LUCENE-3606-hideMethodAgain.patch Linked issues:

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+1 - this would be awesome

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

+1 for read-only readers!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I appreciate the purity of a truly read-only IndexReader, but there are things deleting from IR can do that IW cannot:

Now... just because IR can do these things that IW cannot.... doesn't mean they are compelling / important. Ie maybe no app out there relies on these special things and so we can lose this functionality? I'm not sure...

On removing IR.setNorm, I agree we should just remove it: in trunk now you can make a custom sim that boost norms "live", so we won't lose any functionality. The setNorm code (and the implications – all the per-field normGen, reference counting, scary SR.reopen, etc.) is truly hairy; would be great to nuke it all.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

well, just like live changing of scoring factors with setNorm, the first two of these things could be done by the user, with Bits/liveDocs/filters right?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Mike, I agree that this might affect some apps, but as I said in the introduction: I want to only do this for 4.0 aka Trunk, so we are free to break some strange internal never-documented behaviour. By changing from atomic to per-segment search in 2.9 we already broke lots of incorrectly implemented stuff in Lucene, and these IndexReader deletion behaviour "features" are far less often used than Filters tied to top-level IndexSearcher instead the IR passed to getDocIdSet().

Delete by doc id is dangerous and should be prevented. But you can still do this: Create a custom Filter and do the deletion work in getDocIdSet(IndexReader) and pass it as Query to IndexWriter using ConstantScoreQuery. Using the IndexReader passed to getDocIdSet() you can do any funny things on it, just to produce valid docIds. IW will take care of deleteing them after getDocIdSet() returns.

As Robert said, with FilterIndexReader you can also delete documents without delay (same like in contrib/misc: PKIndexSplitter, MultiPassIndexSplitter) by just overlaying another Bits. To make those hidden documents disappear on disk, too use the aproach decsribed before.

And of course, let's nuke my favourite hate-candidate: setNorms()

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I agree! It looks like in 4.0 it's possible to achieve these same capabilities. So let's nuke away :)

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

OK, I will work on this as soon as I can (next weekend). I will be gald to remove the copy-on-write setNorm stuff in Lucene40 codec and make Lucene3x codec completely read-only (only reading the newest norm file). I hope Robert will possibly help me :-)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi, I will help too. I think norms itself is a pretty big project to clean up and its a good one to do first.

We don't have to do it this way, but here is my idea of a way we could do it in commitable steps:

  1. remove the setNorm first from IR, and fix all tests.
  2. rename NormsWriter to NormsConsumer, rote refactor of norms i/o code into codec as NormsFormat (yes with just one default, and just reads whole byte[])
  3. remove IndexFileNames constant and default implementation handles files(), including .sNNN hairiness
  4. create SimpleText implementation

Then even more cleanups:

  1. split Default implementation to Preflex (with all hairiness like .sNNN) and Lucene40 (clean implementation)
  2. clean up 'behind the scenes' api, e.g. NormsFormat presents docvalues API (hardcoded at fixed bytes), SegmentReader does getArray(). IndexReader still returns just byte[]
  3. finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too.
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

finally, "holy grail" where similarities can declare the normalization factor(s) they need, using byte/float/int whatever, and its all unified with the docvalues api. IndexReader.norms() maybe goes away here, and maybe NormsFormat too.

Thinking about this: a clean way to do it would be for Similarity to get a new method:

ValueType getValueType();

and we would change:

byte computeNorm(FieldInvertState state);

to:

void computeNorm(FieldInvertState state, PerDocFieldValues norm);

Sims that want to encode multiple index-time scoring factors separately could just use BYTES_FIXED_STRAIGHT. This should be only for some rare sims anyway, because a Sim can pull named 'application' specific scoring factors from IR.perDocValues() today already.

Its not too crazy either since sims are already doing their own encoding, so e.g. default sim would just use FIXED_INTS_8.

People that don't want to mess with bytes or smallfloat could use things like FLOAT_32 if they want and need this.

we would just change FieldInfo.omitNorms to instead be FieldInfo.normValueType, which is the value type of the norm (null if its omitted, just like docValueType).

Preflex FieldInfosReader would just set FIXED_INTS_8 or null, based on whether the fieldinfos had omitNorms or not. it doesnt support any other types...

Finally then, sims would be own their scoring factors, and we could even remove omitNorms from Field/FieldType etc (just use the correct scoring algorithm for the field, if you don't want norms, use a sim that doesn't need them for scoring)

This would remove the awkward/messy situation where every similarity implementation we have has to 'downgrade' itself to handle things like if the user decided to omit parts of their formula!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I created a branch, because this job is horrible: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3606

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

As first step, I removed setNorm/doSetNorm from all IndexReaders, deleted the NormModifier from contrib/misc and modified tests.

I am not sure, if TestBackwardsCompatibility really checks that .sXX files are read correctly, we may need to add tests, as we no longer check the writing of norms anymore in the standard tests. So when reading old indexes with modified norms, we must ensure that the modified norms are read.

When changing tests I already removed tests that opened IndexReader in RW mode (like TestBackwards), even if not only norms were modified. But as deletions/commit will also be removed later, this is easier than fixing the test at all.

TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

TestIndexReaderReopen should maybe modified to modify the Index using IndexWriter instead of IndexReader to test reopen functionality better. I removed the whole modifyIndex method from this test, so it is now not going deep enough. We should maybe revert the commit on this file and change modifyIndex to use IndexWriter.

Done.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed removal of IndexReader.deleteDocument*() and undeleteAll() to branch in revision: 1210153:

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Thinking about this: a clean way to do it would be for Similarity to get a new method:

I have been talking to robert about moving norms to IDV earlier and the biggest issues we had in our discussion were when Sims have more than one value for a document since we'd need to add custom (nested) fields or similar things which'd be yet another mess further down the road. I think it's totally valid to restrict to a single DV and if you need more than one you simply use a byte variant and impl you custom decoding / encoding in your sim (that seems more performant too IMO).

Once we moved over to this new API ie. no custom norm code anymore we can actually make use of IDV directly, norms would be just another CFS file and each fields norms would just be a "virtual" file in the IDV CFS. All loading and writing could/would be done by the codecs IDV.

Yet, I think once we have this we should even go further and remove omitNorms from the Field entirely and let the similarity decide if a field has norms or not. This would remove a lot of hairiness in the code too. I'd really like to see that!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed removal of IR.commit(...).

Still todo:

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Clean up IR.open(...) methods to no longer accept readOnly and IndexDeletionPolicy: Core, Contrib, Solr (Modules need fixing, benchmark was broken before, too)

New branch revision: 1210305

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert and I committed more fixes to the remaining tests to the branch. I also added the test of #4694 to this branch and fixed FilterIndexReader.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I tried to fix the remaining tests today, but this seems impossible without IndexReader.deleteDocument(int docId). Some of those tests with commented out by nocommits are so old that its impossible to understand what they are really testing (especially TestAddIndexes and TestIndexWriterMerging). I would simply delete them, because all this stuff is heavyily random tested otherwise (those "old tests" have no randomization at all).

The remaining nocommits are:

./src/java/org/apache/lucene/index/codecs/lucene40/Lucene40NormsReader.java:      // nocommit: change to a real check? see #4693
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: move the whole modification stuff to IW
./src/java/org/apache/lucene/index/SegmentReader.java:  // end nocommit
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: is this needed anymore by IndexWriter?
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/java/org/apache/lucene/index/SegmentReader.java:  // nocommit: remove deletions from SR
./src/test/org/apache/lucene/index/TestAddIndexes.java:  /* nocommit: reactivate these tests
./src/test/org/apache/lucene/index/TestDeletionPolicy.java:  /* nocommit: fix this test, I don't understand it!
./src/test/org/apache/lucene/index/TestIndexWriterMerging.java:  /* nocommit: Fix tests to use an id and delete by term
./src/test/org/apache/lucene/index/TestParallelReaderEmptyIndex.java:  /* nocommit: Fix tests to use an id and delete by term
./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java:  /* nocommit: Fix tests to use an id and delete by term
./src/test/org/apache/lucene/index/TestSizeBoundedForceMerge.java:  /* nocommit: Fix tests to use an id and delete by term

The parts in SegmentReader should be made TODO and a new issue should be opened, which removed RW from SegmentReader (Mike?). The tests should be deleted as described above. Otherwise the branch seems finalized otherwise so I would like to merge back to trunk asap.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Tests are all fixed... i think this one is close.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks to Robert for fixing the last tests.

This is the patch of the branch merged back to trunk. I will try to commit this ASAP, as it might get outdated very fast.

Please review the changes!

After committing I will keep the issue open and add deprecations to Lucene 3.x on most methods removed in trunk.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1! Nice work guys... I love all the minuses :)

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+1 I love that Norms are under codec now!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed to trunk revision: 1212292

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Patch that hides an accidently visible method IndexReader.open with Directory and IndexCommit again.

Committed in trunk revision: 1212294

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I will post a patch later adding the needed deprecations to 3.x! This issue is kept open until the deprecations are solved.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here a patch that deprecates the removed methods in 3x:

I did not fix the tests to not use deprecated readOnly=true IR.open()! :(

I found a serious API glitch in 3.x with openIfChanged:

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch, some javadoc changes (deprecated method replacements) and one more missing method.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Final patch including changes.txt. Will commit this now.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed 3.x revision: 1212539

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Merged changes to trunk revision: 1212545

This issue is now finished, thanks to Robert for the great help during fixing tests and funny discussions :-) edit ...and for porting norms to codec! edit