apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Support Filtering Segments During Merge [LUCENE-4560] #5626

Open asfimport opened 11 years ago

asfimport commented 11 years ago

Spun off from #5623

It is desirable to be able to filter segments during merge. Most often, full reindex of content is not possible. Merging segments can sometimes have negative consequences when fields are have different options (most restrictive option is forced during merge) Being able to filter segments during merges will allow gradually migrating indexed data to new index settings, support pruning/enhancing existing data gradually

Use Cases:

patch will be forthcoming


Migrated from LUCENE-4560 by Tim Smith, updated Dec 20 2012 Attachments: LUCENE-4560.patch, LUCENE-4560-simple.patch

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

Attaching patch

patch adds MergedSegmentFilter base class and adds config setter akin to IndexReaderWarmer on IndexWriterConfig (by all means, suggest better names)

SegmentMerger will use this (if specified) to filter any segments being merged

Test case included that uses filter to remove an indexed field during merge.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Hi Tim. While in general I'm not against the idea (and I think that in general have some more control during the merge stage is needed), may I ask why can't you e.g. do this code (borrowing from your patch):

IndexWriter writer = new IndexWriter(newDirectory);
writer.addIndexes(new RemoveFieldReader(oldReader));

That will accomplish, I believe, exactly what you want, no?

The benefits to your approach is that the filtering is done in-place, i.e. no need to add to a new directory, then switch old/new dirs. But it also may inadvertently add bugs, e.g. if someone mistakenly decided to remove a field, or worse, removes the wrong field ... w/ the addIndexes approach, you can do the process offline, investigate the result index and once you're contend with it, make the switch.

I can see the benefits in both approaches, but I think that the addIndexes approach is safer, as it's not 'online' and does not change the source directory. I'm not sure how 'online' this process needs to be though. How often do you remove fields, or change index options? That's a fairly serious decision IMO, and should be done w/ care and lots of testing. Doing that in-place may be dangerous.

About the patch, it's very simple and clean, which is a good thing ! I would make RemoveFieldReader extend FilterAtomicReader, to save you some lines of code, even though it's just a test class.

If you do (and others agree) want to continue w/ the online filtering approach, perhaps, instead of introducing a MergedSegmentFilter, we could make SegmentMerger pluggable, with few extension points that allow you to allocate your own AtomicReader ... just a thought, I know it's not directly related to this issue, but if we're going to open segment merging up for some serious hacking, let's do it w/ all intentions :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thinking about this some more, I really don't thing it's a 'gradual' thing that you do to the index:

So I'm beginning to think that this process should not be an incremental/gradual/online thing, but rather an addIndexes type of process, that you run once, and know that you're done with it, until the next time where you need to rewrite the index, w/o actually re-indexing the content.

BTW, did you take a look at #3706? It is about adding a FilteringCodec which filters the data that it writes/reads. Could it help you here? If so, I think that it has better chances to get committed, than the approach in this issue (Codecs are already an extension point...).

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

My base requirement here is that this be an online process. As such, the add indexes approach is really not useful as i see it, especially as it requires 2x disk space, as well as completely new index directories, it does not play well with upgrading a user's existing index.

what i see as needed is the ability to gradually migrate indexes such that any individual segment is itself consistent. currently, merging of indexes can result in loss of indexed data or otherwise break consistency, as in #5623

it is 100% ok if all segments have not been processed as i can identify each segment's settings at index open/search time, and optionally filter/search/read segments differently.

It is true that once you start using this SegmentMergeFilter, you pretty much have to keep using it forever. I don't see this as an issue as when dealing with supporting old indexes, you constantly have to support migration of data that was indexed using old code. For instance, as time goes on, my "MergeSegmentFilter" will do more, supporting migrating more and more old index formats/config settings to the latest indexing format/settings.

At quick glance, FilteringCodec looks like it applies to writing new content, not reading existing indexes? Doesn't seem quite like that would do the trick here. I would need some way to have the index writer wrap the codec for existing segments in order to inject my custom filtering that would apply during merging. That would be logically identical to the patch provided, however would potentially result in a much more complex patch.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I don't see why you need to gradually migrate segments, rather than a one-off thing that you do when you decide to change the format of the index.

Is it really important to do this gradually? Or if there was a tool which allowed you to do an in-place upgrade of certain segments, that would be something to consider? For instance, you can do something similar to:

Directory dir; // directory with indexed documents
IndexWriter writer; // your instance of IndexWriter
IndexReader r = YourIndexReader.open(dir);
writer.deleteAll();
writer.addIndexes(r);
writer.commit();

This is all transactional, so until you commit, searches don't see any of this work.

Note however that while it's seemingly done in-place, you need to copy all the segments, even if they don't need to be upgraded.

I guess that I just can't think of a good reason to do a gradual upgrade of segments. Whenever I had to upgrade old indexes, it was done as a one-off process and that's it. E.g. IndexUpgrader is such a tool – upgrades the index in place.

Having said that, if others think that it might be useful to let one extend e.g. IndexWriter by providing different instances than SegmentReader (hard-coded in IW), then I prefer that route to the MergedSegmentFilter. Today it's SegmentMerger, tomorrow it will be something else. If we want to handle it, let's handle it from the root. SegmentMerger itself really has nothing to do with filtering readers. That way, you could write something like IndexUpgrader (or UpgradeMergePolicy) and upgrade the index as a one-off process, in place, touching only needed segments.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We had something similar in the past (called PayloadProcessor), which was removed completely in 4.0 (without "replacement"). The reason was, that the stuff can be implemented inside a FilterAtomicReader and used with IW#addIndexes(IndexReader...). I agree with Shai, that this should be enough for most cases, especially as gradually merging segments can corrumpt your index if you have an error.

If you really want to merge in-place: Your patch has nice ideas from my perspective, only the "wrapping" should be done in the MP and not on IndexWriter level (the number of settings in IWConfig is already too big). So the main thing that needs to be done here is:

Another possible approach without modification in Lucene core is:

Uwe

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

The "gradual" approach is very much required. Its possible that a config change by a user will result in the need to do a filtered reader on a merge.

For instance, if you index a field without offsets, then you shutdown, start up with indexing of offsets. Currently, this situation will result in newly indexed offsets being obliterated on merge (#5623) with no possible way to save them.

Especially in this case, the addIndexes() approach is way too costly just for a small configuration change. Small config changes shouldn't require the equivalent of a full optimize to take effect.

Also, i argue that any addIndexes() approach is even more dangerous and just as prone to corruption. This can result in the same filtering of readers as the attached patch provides, however it modifies the entire index, thereby causing any corruption to be much more widespread. (of course either way, it is up to the person implementing their custom filter to guarantee that no corruption occurs and that their code produces consistent indexes)

I will look into the MergePolicy approach. Off hand, it looks like this may still require a patch as the SegmentMerger is currently only aware of SegmentReaders from merging, however i may be able to add my own SegmentInfo's to the OneMerge replacing the codec with a wrapped codec that will apply my filtering. it'll be about a week before i can get back to testing this, i'll report back then.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Small config changes shouldn't require the equivalent of a full optimize to take effect.

I wouldn't call these "small config changes" :). And if you give your users the possibility to randomly decide to turn off tracking offsets ... well then either your users are all Lucene highly-involved committers/users, or you're putting too much power in their hands.

I also can't tell, in your world - it's definitely not common in mines, how often will such a configuration be changed. Removing fields no longer needed - that's one thing. Changing IndexOptions, that's a totally different thing.

Also, i argue that any addIndexes() approach is even more dangerous and just as prone to corruption

Right, but because of how addIndexes works, you get the chance to review the result index before you choose to publish it. If anything is not entirely as expected, you discard the operation, fix the code, start over. The source index remains intact.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Off hand, it looks like this may still require a patch as the SegmentMerger is currently only aware of SegmentReaders from merging,

This is not true, you can merge any atomic reader! It may have some optimizations for SegmentReaders, but generally any type of atomic reader can merge into an index (e.g. with addIndexes(IndexReader...) -> which is the proposal by Shai and myself)

Also, i argue that any addIndexes() approach is even more dangerous and just as prone to corruption.

This can result in the same filtering of readers as the attached patch provides, however it modifies the entire index, thereby causing any corruption to be much more widespread. (of course either way, it is up to the person implementing their custom filter to guarantee that no corruption occurs and that their code produces consistent indexes)

Read my comment carefully: You can just trigger a merge of segments that you really want to change. The code would look like:

This will trigger something like a forceMerge(1), your resulting index will have one segment (it is optimized). This approach is as heavy as your merge approach, because in your do-it-on-merge you have to at least forcefully merge all segments to upgrade your index (e.g. call forceMerge(1)).

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One addition about your testcase extending FilterAtomicReader: The filter reader there does not filter doc values or stored fields. We currently have a full-featured FieldFilterAtomicReader (http://lucene.apache.org/core/4_0_0/test-framework/org/apache/lucene/index/FieldFilterAtomicReader.html) in the Lucene test-framework, that can take a list of fields to exclude or include. It is used in some merge tests. My proposal was already to make this really useful class public.

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

offsets can be used for highlighting users want to configure highlighting per field users don't always know what fields they want to highlight and may change this setting frequently setting "highlighting=true" on a field should be fully possible without full reindex required (old documents of course will not be highlighted, or may default to a slower highlighting method that does not require offsets) (slowly refeeding old documents will allow users to get full functionality for old docs as well, however refeeding may take weeks and should not impact indexing of new content)

i can't proactively always enable offsets on the off chance they will enable highlighting in the future as this implies additional disk requirements

this is the primary use case that spawned this ticket right now, due to the merging behavior, i cannot use indexed offsets for highlighting as a setting change will result in merges destroying offsets.

this filtering merge reader approach also fulfills other requirements i have for migrating old indexed content to use new features so it would be a win-win for me to use this filtered merge reader approach to ensure consistency and conformance with my schema.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats not really true Tim.

You must be VERY CAREFUL when deciding upon index options. Its been this way since even 2.x, if you omitTF, then later decide you want TF and positions, you need to re-index. Thats why these are expert options.

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

Its been this way since even 2.x, if you omitTF, then later decide you want TF and positions, you need to re-index.

re-index is the key word here re-indexing is not something that can always be done, or implies a massive cost. changing a schema setting for one field should not require a full re-index. i'm afraid i'm in a world where re-index is a 4 letter word and should only be done in the most extreme of circumstances. my whole point here is that migration should be possible via a pluggable policy

Thats why these are expert options.

i know these are expert options, but there should also be a means to support migration to new settings (albeit an expert means to do so that may have some consequences for how old documents were indexed)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Then it sounds like in your world, where you don't know how text should be indexed, you shouldn't use such advanced options.

Instead maybe you should index in a generic way, e.g. using the classic highlighting techniques. These index options are ways to customize what information the index should retain in situations where you know this.

There is no migration strategy at all if you choose the wrong options: you threw that information away, so you need to re-index.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

this filtering merge reader approach also fulfills other requirements i have for migrating old indexed content to use new features so it would be a win-win for me to use this filtered merge reader approach to ensure consistency and conformance with my schema.

I just have to repeat myself: We dont need to modify Lucene to make this "migration" work! See my last comment, it lists the whole workflow to do this "merge" on the directory index without the need for temporary index instances, completely on-the-fly. Your testcase can be rewritten to exactly do what you want without any custom merge policy / SegmentMerger stuff.

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

A migration strategy does exist and is very simple. It is up to the implementer to determine how data will be migrated and properly communicate that to the user base so expectations are set properly. All migration will have pros and cons, and my require gradual reindexing of content to ensure consistency for old documents. but this is up to the implementer, and shouldn't be imposed by the lucene apis.

Lets analyze the highlighting case based on indexed offsets.

Assume documents were indexed with no offsets. Highlighting was being done for these documents using tokenstream based highlighting based on stored field text.

Now, the user switches to using a more efficient offsets based highlighting. new documents will be indexed with offsets.

Right now, assuming no merging was done, it is very easy to see if a document has indexed offsets and on a per-document basis documents can be highlighted according to what was indexed.

Then a merge happens. (currently, this will force tokenstream based highlighting for all documents, undoing the configuration setting)

If applying a migration policy, old documents can have 0,0 offsets applied. (this is the decision of the migration policy and is up to the implementer of the migration policy) Now, when highlighting is applied, if all positions have a 0,0 offset for a document, it can fall back to tokenstream based highlighting. if positions have offsets, it will use them to perform optimal, full-featured highlighting.

This will result in slightly slower highlighting for old documents. user experience can then be improved by doing a gradual reindex of old documents, without requiring user to blast away their existing index.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here is your testcase rewritten to do the same without modifying any line of code in Lucene!

Please note that RemoveFieldReader is incomplete (does not remove stored fields or doc values), the test should use FieldFilterAtomicReader instead (which is completely removing fields).

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

Uwe,

i plan to investigate your suggestions, and it may result in not requiring any additional patching to lucene. it'll be about a week before i can get to that and i will post my results then. i still don't see the addIndexes() approach as viable, even how you suggest as that will require up front migration steps instead of gradual migration. The merge policy approach you suggested will likely be more useful to me, however this will be a nasty merge policy.

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

i found a 100% pure codec approach for providing all the functionality i require here and more, requiring no patches

if any committer has interest in pushing this ticket forward, i can clean up patch/add suggestions, etc, otherwise this ticket can be closed

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Tim is the codec approach you used generalizable such that we can benefit from it? Do you wanna open another issue for that?

asfimport commented 11 years ago

Tim Smith (migrated from JIRA)

codec approach i'm taking is pretty specific, incorporating my schema/configuration to allow migrating/enhancing options/features/indexing formats/etc (still exploring all the possibilities)

there may be a few things that would reduce the overhead/enhance the ease the implementation. i will create new tickets with patches as i identify them.

NOTE: the codec api is very nice. congrats to all involved in making that happen.