Allow to control how payloads are merged [LUCENE-1585]

asfimport commented 15 years ago

Lucene handles backwards-compatibility of its data structures by converting them from the old into the new formats during segment merging.

Payloads are simply byte arrays in which users can store arbitrary data. Applications that use payloads might want to convert the format of their payloads in a similar fashion. Otherwise it's not easily possible to ever change the encoding of a payload without reindexing.

So I propose to introduce a PayloadMerger class that the SegmentMerger invokes to merge the payloads from multiple segments. Users can then implement their own PayloadMerger to convert payloads from an old into a new format.

In the future we need this kind of flexibility also for column-stride fields (#2308) and flexible indexing codecs.

In addition to that it would be nice if users could store version information in the segments file. E.g. they could store "in segment _2 the term a:b uses payloads of format x.y".

Migrated from LUCENE-1585 by Michael Busch, resolved May 14 2010 Attachments: LUCENE-1585_3x.patch (versions: 5), LUCENE-1585_trunk.patch (versions: 3)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree: extensibility to the SegmentInfo (just like FieldInfo) is needed.

I think, ideally, with #2532 there would be a "payloads codec" responsible for reading & writing & merging. All things (not just payloads) need this.

EG, because the payload is opaque to Lucene, we now must encode the length of the byte[] per term occurrence, but if [say] the codec knows it's always N bytes, or it's a function of term, or something, the codec would be free not to encode the number of bytes and derive it from other sources.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

All things (not just payloads) need this.

I agree. But as you always say, Mike, "Progress, not perfection". I think it would be nice to get this in 2.9 for payloads only, and then in 3.x for codecs.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Agreed!

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Michael, I would like to take a stab at it if you don't mind (unless you are working on it). In fact, I've investigated and was about to open an issue before I came across this one :).

W/ Flex, one can set a Codec, however that Codec will be used for regular segment merges. The problem (that I've run into and you describe here) is that during addIndexes*, one would need a different Codec for payloads, in order to rewrite the ones that come from the external indexes. I was thinking to add another variation of addIndexes* which take a PayloadConsumer as input, and in SM.appendPostings, after it reads the payload from the other index, invoke PayloadConsumer on that payload, and only after that write it to PositionConsumer.

That API would of course be marked as experimental.

Also, SM.appendPostings is called from two different code paths - addIndexes* and regular segment merges. For the regular merges, the PC should be null, but for the addIndexes it may not be. So we'll need to add that API to a bunch of classes in the call chain, but all of them are either private methods or package-private classes.

How's that sound? I can cons up a patch if that sounds reasonable.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Michael, I would like to take a stab at it if you don't mind (unless you are working on it).

Please go ahead! I'm not working on this currently.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds PayloadConsumer to 3x branch:

Adds variants to addIndexes* which take PayloadConsumer
Adds TestPayloadConsumer
Propagates PayloadConsumer down then chain to SegmentMerger, allowing null as a parameter.
Covers 'backwards' too

I will work on the trunk version now. Perhaps for trunk we can avoid adding variants to addIndexes?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch covers trunk changes, except for backwards since it's no longer in trunk. I'd appreciate a review on how I used the flex API and whether there is a better API I should have used (mostly in the tests I guess)

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I went over the tests and realized I didn't write one which adds indexes into an already populated index. Ideally, the payloads in the existing index should not be re-processed b/c of the external ones that are added. But this doesn't happen, as addIndexes and addIndexesNoOpt don't distinguish well between local and external segments. It all boils down to IW,merge() which calls SM.merge() ... Then I figured a single PayloadConsumer "might not fit all" - e.g. there are cases where different PCs are needed for different indexes. The app can call addIndexes one at a time, but that's not efficient. So I think the entry-level API should be a PayloadConsumerProvider, which declares one getPayloadConsumer(Directory) method. It returns a PC corresponding to a Directory. It gives the app the freedom it needs to:

Always return the same PC for all Dirs.
Return different PCs for different Dirs.
Return null for some Dirs, so that their payloads are not re-processed.

Setting out to impl that, I've noticed addIndexes and addIndexesNoOpt behave differently. While addIndexes interacts w/ the SegmentMerger directly (and hence can easily pass it the PCP), NoOpt reads the SIs from the given Dirs, call maybeMerge(), which triggers SM.merge(), to merge local + external segments. We cannot pass PCP to maybeMerge since that won't help - the call chain hits MergeScheduler, which loops-back at us when it calls IW.merge() .. seems way too complicated. Additionally, there is no way to guarantee that PCP won't be invoked during addIndexesNoOpt on local segments (unless it does not provide a PC for the target Dir) ...

Therefore, I'd like to add PCP to IWC, for the following reasons:

As I said above, there's no way to guarantee it won't be invoked on local segments when *NoOpt is called.
There's no clean way to ensure NoOpt passes it on to SM, w/o passing PCP through MergeScheduler.
It might be useful for apps that want to rewrite their payloads only over time – sort of a mini app-level migration tool (of just payloads).
It cleans the API - does not affect 'backwards', no need to pass it on through several methods until it gets to SM – simplifies the solution.

This is an expert API. Therefore, apps that set it probably know what they're doing. Therefore I believe they will be able to understand how to not invoke their PCs on the target dir's segments.

What do you think?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds PayloadConsumerProvider to IWC, and more test cases to TestPayloadConsumerProvider. I'll update the trunk version after I get some feedback on the overall approach. The use pf PCP is much cleaner - touches no API except IWC.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK new approach sounds good Shai! The 3x patch looks good.

Should we maybe name it PayloadMergeProcessor or something? Transformer? Consumer isn't really what it is... really it transforms the payloads on during merging.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I'm fine w/ not naming it Consumer - I agree it does not really consume it. But if we go with PayloadMergeProcessor, we'll need PayloadMergeProcessorProvider and they become quite long names :). I was thinking PayloadProcessor and PayloadProcessorProvider (have cool acronyms to PP and PPP), but then people might get confused that it processes all payloads (maybe before they are even written the first time), while it is actually invoked only during segment merges.I was following the *Consumer pattern I saw all over the place w/ SegmentMerger, and thought that if someone ever reads SM code, it will swallow easily another *Consumer one ...

So between PC, PMP and PP - I prefer PP - the documentation should clarify what it does.

But I'm open for suggestions.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I see that in trunk, SegmentMerger does not get IW in its ctor, and there are a couple of places which call that ctor. So perhaps to not affect it, I'll add another ctor which takes a PC/PP (whatever name we decide on), and the current one will delegate to it w/ null? Or ... I can have SM's ctor accept IW and take whatever it needs from it. Comparing it to 3x, it will work exactly the same, only will obtain Directory, TermIndexInterval and CodecProvider from IW.

What do you think?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should somehow get "merge" into its name? PayloadMerger? (Though that's overstating it does – IW handles merging payloads, while this class just processes them prior to merging).

If we can't come up w/ anything better, I think PayloadProcessor* is acceptable even though it's overstating.

Or ... I can have SM's ctor accept IW and take whatever it needs from it.

I'd prefer explicit ctor that passes exactly what SM needs... the more "decoupled" we can keep these components, the better, I think?

And I'd just make the new PPP a required arg and fix places that call it to pass null? I don't like ctor proliferation :) And, this is a package private API...

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I hate it when it happens, but better sooner than later - I realized the API must take into account the current Term. We cannot process all the payloads in the index the same way. So how about the following:

PayloadProcessorProvider will accept both a Directory and a Term, and will return a suitable PayloadProcessor for that Directory, and if needed, for the Directory+Term combination.
PayloadProcessor will continue to work as is and will expose the same API - a payload is still a payload. Its the responsibility of PPP to return the right PP instance for the given Dir+Term It does not make sense that the payloads of all the terms in the incoming indexes will need to be processed. Specifically, the scenario I have at hand needs to rewrite payloads of certain postings only, but the index contains payloads in other postings as well.

For 3x that's easy - SMI holds the current Term that is processed. But I don't see an equivalent in trunk, in PostingsConsumer. It receives a DocsEnum which does not tell you the term it works on, and MergeState which includes just FieldInfo, which can tell you the field name? Any ideas how I can get the Term this posting belongs to? (I know there is no Term, but field + BytesRef will do).

Mike - I'll add PP as a required arg to SM, np. I was only suggesting to pass IW so that we can avoid changing it in the future, but explicit args are fine by me.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

PayloadProcessorProvider will accept both a Directory and a Term, and will return a suitable PayloadProcessor for that Directory, and if needed, for the Directory+Term combination.

OK, though this is potentially rather costly – a huge number of terms are visited when merging. I guess PPP impls would reuse instances of PP? But then how will it handle threads...? (Since multiple threads may be merging at once). Maybe we need three tiers? PPP, PP, PPperTerm, such that the PP is used only by one thread in Lucene. Hmm... getting hairy.

Any ideas how I can get the Term this posting belongs to? (I know there is no Term, but field + BytesRef will do).

Maybe set current field & current term in MergeState?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

How to handle that case is entirely up to the PPP impl. Some will return the same PP for all terms, but maybe different ones per directory, while others will only care about few terms, returning null for all the rest. In fact, I think the common case will be either handling all payloads by the same PP, or handle some select terms by either one or more PPs.

As for threading, this is also something the PPP can take care of. Strangely, flex allows stateless PPs mor easily b/c it uses BytesRef, while in 3x one needs to call both process and payloadLength() and hence concurrency is more a problem.

I believe the common use will be few PPs that handle few terms. Of course once this is out, people will find original uses for it :). But for now, I don't see a big perf hit...

about performance, we're checking for every position and doc whether the processor is not null. I guess it is better than having a no-op processor? Maybe I can factor that code out to two methods - one that always assumes there is a processor and one that doesn't?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I've been thinking about the multi-threading issue, and as far as I understand, it only concerns the local segment merging? PPP works w/ Directory+Term because the format of the payloads is per term for the entire Directory (not per segment). Therefore, I don't think there is multi-threading issues with the external Directories (the result of addIndexe*)?

For the local segments, I see what you mean - it is possible that several threads will ask a PP for the same Dir+Term. PPP implementations can still work well in such scenario (if they wish to process payloads of local Dir as well) by holding a ThreadLocal PP for Dir+Term combination? I think proper documentation should be enough in this case. The whole point of this issue is to allow better control when addIndexes* are used. Affecting local payloads is a nice bonus, and I think we should wait for a real scenario which takes advantage of that. If the threading documentation warnings won't help, we can discuss then how to solve it?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch includes Dir+Term combination in PPP, as well as proper jdocs about concurrency concerns. After we settle it for 3x, I'll update the patch for trunk

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Small correction - the comment I've made about concurrency issues w/ just the local directory is wrong because you can add an index w/ multiple segments and if you use CMS and the MP returns several OneMerges, then concurrency issues will happen in that case too.

Anyway, the 3x tests pass w/ the patch. How do we proceed from here - can I port that patch to trunk or does anybody have more comments?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Make sure you fix the whitespace – some indents are now tabs or 8 spaces, but should be 2.

I believe the common use will be few PPs that handle few terms.

Or, maybe even more common will be per-Directory switching and ignoring the Term? EG if I changed my payload format (for all terms) at some point...

Though we don't have great support for versioning of payloads during searching... eg PayloadTermQuery doesn't make it simple to figure out which Dir you are now searching...

My only concern w/ this API is that it has a built-in unnecessary global perf/synchronization cost, by design: I'll have to use a sync'd map or a thread local to implement that method. Even if my app ignores the Term, I'll need to sync. This sync is global – all merges running concurrently, per Term, will share a single global lock.

But it's only the Dir lookup that requires sync.

So if, instead, the Dir lookup and the Term lookup were separate method calls, I'd only need sync on the Dir lookup (called very rarely often – once per segment on the start of the merge). The Term lookup, called far far more often, is guaranteed to be thread private so it'd need no sync.

I guess in practice the sync cost may not be such a big deal? So maybe we could commit w/ this approach (it is experimental), even with this limitation? It's just that I don't like adding APIs which make our concurrency worse... we are supposed to be moving in the other direction :)

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Make sure you fix the whitespace - some indents are now tabs or 8 spaces, but should be 2.

That's weird. I'll check it again. Can you point me to a specific place where you've noticed that?

About concurrency - do you mean we should separate the Dir and Term APIs, to be PPP -> return DirPP -> return TermPP? Not necessarily these names but a chain of calls to get to the TermPP? It can work, but what guarantees that PPP -> DirPP is synced? The application will still need to sync the "per-Dir-PP" instance by a ThreadLocal or something, so what exactly do we gain here?

Perhaps I misunderstood your point - if so, can you please clarify?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch includes:

PayloadProcessorProvider which returns DirPayloadProcesor (given a Directory)
DirPayloadProcessor returns a PayloadProcessor (given a Term)
All classes are defined as static inner classes in PPP - to make it clear that all 3 are coupled together.
SegmentMergeInfo contains a DirPayloadProcessor field - which is set by SegmentMerger once, before all of the terms are processed.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Shai! I like the cascading of Dir/Term processors.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I've reviewed the patch again, and I think setPPP should move from IWC to IW. PPP is more of a temporary setting - if you only want to use it for addIndexes*, then you probably want to set it just before the call, and unset it afterwards. Otherwise, unnecessary getDirPP would be called, when you don't really care about them. So PPP is like InfoStream in a sense - usually it'll be a point-in-time operation. You can still set it right after IW is created, if you want to use it for other merges too.

Since IWC is a "write-once" object (documented), it doesn't make sense to set PPP whenever you create an IW, just because at some point you know addIndexes will be called. And also, it doesn't make sense to create a new IW instance for that purpose only. So I really feel it should be an IW setting and not IWC.

What do you think?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, though this means any other merges that happen to be kicked off while your addIndexes is running, would also consult the PPP? So PPP impl would have to "know" this may happen and deal accordingly?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

You're right. And apps that use it may want to handle it by aborting all merges first, or call sync() on CMS.

Still, I think that I'm most cases, PPP will be used for addIndexes calls and setting it I'm IWC will be a limitation - e.g. you may want to use different PPPs for different addIndexes calls and opening a new writer just for that seems too much?

I'll include a NOTE about it in the jdocs.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Still, I think that I'm most cases, PPP will be used for addIndexes calls and setting it I'm IWC will be a limitation - e.g. you may want to use different PPPs for different addIndexes calls and opening a new writer just for that seems too much?

Yeah I agree.

Can we go back to making it a parameter to addIndexes? What was the issue w/ that approach again? Was it just the difficulty of tracking internally which segments should be mapped and which shouldn't?

Or.. maybe we should leave it as globally settable on IW, since users may want to rewrite payloads for "ordinary" segment merges...

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

What was the issue w/ that approach again?

addIndexes had no problem. It was addIndexesNoOptimize which calls the MS, which calls back to IW.merge. The only way to pass it through was to get it through MS, which is not good.

Or.. maybe we should leave it as globally settable on IW, since users may want to rewrite payloads for "ordinary" segment merges...

Exactly ! I think that's a useful utility for e.g. index migration of existing indexes. You can set it on IW and then call optimize(). There are other use cases as well.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK so let's have it as a free setter on IW...

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

setPPP moved from IWC to IW. I'll go update the trunk one now.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Update trunk's patch.

All tests pass. I plan to commit this tomorrow.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patches look good Shai.

One thing – in TermsConsumer, it'd be nice to not step through the for loop checking for non-null dirPP, if the IW had no PPP?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Just an idea: Could we also use this for the term bytes itsself? E.g. when converting NumericFields in our 4.0 inde x converter to use the full 8bits? So we just process the old index and merge to the converted one? During that all terms are converted using the processor?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Good idea Mike !

I've added hasPayloadProcessorProvider to MergeState and use it in TermsConsumer.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Could we also use this for the term bytes itsself?

I think you'd want to use the same approach, yes. But I'm not sure I want to reuse the same classes for that purpose, for several reasons:

The classes have the word Payload all over the place - javadocs, names etc. And for a good reason IMO - that's what they do.
One is expected to include different PP for different Directory and Term, but to convert the NumericField terms, I don't think one will use different PPs at all? I.e. a single TermsConverter / NumericFieldsTermConverter will be good for whatever Directory + Term?
The sort of operation you suggest (converting terms) seems to be a one time op – when I migrate my indexes? PPP on the other hand (at least in my case) will be used whenever I call addIndexes* so that I can process and rewrite the payloads of the incoming indexes.

So while both will do byte[] conversion, I think those are two separate tools. Your should probably exist in a o.a.l.migration package or something, because it will be relevant to index migration only? Or did I misunderstood you?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also, remapping the term bytes in general on the fly is tricky, since the remapping could alter their sort order.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed revision 944214 (3x). Committed revision 944220 (trunk).

Thanks Michael for starting this !

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

apache / lucene

Allow to control how payloads are merged [LUCENE-1585] #2659