Incremental Field Updates through Stacked Segments [LUCENE-4258]

asfimport commented 12 years ago

Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

5341
- 4910

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A few things I dont understand:

when updating a document, how do you know which terms to apply negative postings to?
how can the idea of updating "individual terms" work as far as length normalization information? How will that be reconciled?

In truth I think term is too fine-grained of a level for updates in lucene because of norms, only updating whole fields of a document will really work (as then the norm can simply be recomputed and replaced).

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) – I'll try to summarize below a more detailed proposal:

API Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument):

Constraint defines 'which documents' should be updated, and follows today's deleteDocument API (takes Term, Query and arrays of each)
OP defines the actual update to do on those documents:
- It has a TYPE, with 3 values (At least for now): **# REPLACE_DOC – replaces an entire document (essentially what updateDocument is today) **# UPDATE_FIELD – incrementally update a field **# REPLACE_FIELD – replaces a field entirely ** In addition, it takes a Field[] (or Iterable) to remove/add. ** In light of the recent changes to IndexableField and friends, perhaps what it should take is a concrete UpdateField with a boolean specifying whether to add/remove its content. Suggestions are welcome !

Implementation The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted.

Following that semantics, a segment could be comprised of these files:

Layer 1: _0.prx, _0.fnm, _0.fdt ...
Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) – override/merge info from layer 1
Layer 3: _0_2.prx – override/merge info from layer 2
Layer 4: _0_1.del – deletes are always the last layer, irregardless of their 'layer id' – _0_1.del overrides everything, even _0_100.prx.
- And they can be stacked on themselves as today, e.g. _0_2.del etc. I believe that we'll need an UpdateCodec or something ... this is the part of the internal API that we still need to understand better. Help from folks like you Robert will be greatly appreciated !

Two options to encode the posting lists:

field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded efficiently
1. +field:value --> 1, 8, 12
2. ~~field:value -~~> 5, 17

Ideally, the way incremental updates will be applied will follow how deletes are applied today:

An update always applies to all documents that are flushed
And to all documents currently in the RAM buffer
But never to documents that are indexed later

Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed?

If it's a new Codec, then SegmentReader may not even need to change ...

The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk – as a sparse bit vector that marks the documents that are no longer associated with it ...

I also think that we should introduce this feature in steps:

Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc. ** Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation. ** This I believe is also the most common use case for incremental field updates
Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ...
Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload
- Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point.
Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options:
1. Document a limitation to updating a field with norms, at your own risk.
2. Enforce REPLACE_FIELD OP on fields with norms.

Since norms are under DocValues now, maybe that's solvable, I don't know. At the moment I think that we have a lot to do before we worry about norms ...
I also think that we should start with the simpler ADD_FIELD operation, and not support REMOVE_FIELD ... really to keep things simple at start.

I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think I'm sold on introducing the feature in steps.

I think its critical for something of this magnitude that we figure out the design totally, up-front, so it will work for the major use-cases. I think its fine to implement in steps if we need though.

Honestly I think we should throw it all out on the table and get to the real problems I think that most people face today:

For many document sizes, use-cases (especially rapidly changing stuff): The real problem is not the speed of lucene reindexing the document, its that the user must rebuild the entire document. Solr solved this by providing an option where you just say "update field X" and internally it reindexes the document from stored fields (for that feature to work, the whole thing must be stored). We shouldn't discard the possibility of implementing cleaner support for a solution like this, which wouldnt complicate indexwriter at all.
A second problem (not solved by the above) is that many people are using scoring factors with a variety of signals and these are changing often. I think unfortunately, people are often putting these in a normal indexed field and uninverting these on the fieldcache, requiring the whole document to be reindexed just because of how they implemented the scoring factor. People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc.

So we can look at several approaches to solving this stuff. I feel like both these problems could be solved via a contrib module without modifying indexwriter at all for many use cases: maybe better if we go for more tight integration. And with those simple approaches I describe above, searching doesn't get any slower.

But if we really feel like we need a "full incremental update API" (i know there are a few use cases where it can help, I'm not discarding that), then I feel like there are a few things I want:

I want scoring to be correct: this is a must. If we provide a incremental update API on IW and it doesnt achieve the same thing as updateDocument today, then its broken. But I think its ok for things to be temporarily off (as long as this is in a consistent way) until merging takes place, just like deletes today.
I want to know for any incremental update API, the cost to search performance. I want to know, at what document size is any incremental update API actually faster than us just reindexing the document internally, and how much faster is it? I also want us to consider that compared to the slowdown in search performance. We should know what the tradeoffs are before committing such APIs.

I strongly feel like if we just add these incremental APIs to indexwriter without being careful about these things, the end result could be that people use them without thinking and end out with slower search and worse relevance, thats why I am asking so many questions.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think it's ok if we introduced IFU for DOCS_ONLY at first, throwing exceptions otherwise. E.g., UpdateField override setOmitNorms and such and throws UOE... at first.

Everything else will still work as it is today...

Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

Sorry for the short response, JIRA isn't great with smart phones: -).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

I don't think thats really comparable at all, for two reasons:

Codecs can be considered a "rote" refactoring of the XXXWriter in 3.x. I'm not trying to diminish the value but its just an introduced abstraction layer. Something like this is different in that its algorithmic.
The fact that Codecs only handled postings at first wasn't easy to fix after they were introduced as postings-only. Once they handled postings initially, this was a significant refactoring.

I'm not trying to pick on your proposal, I'm just saying there are things I don't like about the design.

I think that updating individual terms is a fringe use-case, and not the major use case for incremental updates, which is to update the contents of one field, without reindexing the entire document. This was also noted by someone else on the discussion thread. This issue seems to be solely about supporting the 'tagging' use case, which is just one of many.
I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.
I think that negatives won't compress well, as in general compression algorithms for IR in the last years focus on positive integers.
I think merging the postings will be slow: I don't like the tradeoff of slowing down searching so much for what I'm not even sure will be a significant speedup to indexing.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

...which is to update the contents of one field, without reindexing the entire document

I agree, but I distinguish between two operations:

replacing the content of a field entirely with a new content (or remove the field)
update the field's content by adding/removing individual terms

I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.

I disagree. Where I come from, the most common use case where such operation will be useful is when a single change affects hundreds and sometimes thousands of documents. An example is a document library like application which manages folders with ACLs. You can add an ACL to a top-level folder and it affects the entire documents and folder beneath it. That results in reindexing, sometimes, a huge amount of documents.

I don't diminish the use case of updating a field for scoring purposes, not at all. Just saying that starting by supporting one use case is more than supporting no use case.

Now, and this probably stems from my lack of understanding of the Lucene internals – I see "supporting terms that omit TFAP" as a starting point because that is the easiest case, and even that requires a lot of understanding of the internals. After we do that, I'll feel more comfortable discussing other types of updates for other field types ... at least, I'll feel that I have more intelligent things to say :).

Regarding your other concerns, I share them with you, and we of course need to benchmark everything. I don't know how this affect search or not. But those updates will get merged away when segments are merged, so while I'm sure search will be affected, it's not for eternity - only until that segment is merged. And, I think we need to add capability to MergePolicy to findSegmentsForMergeUpdates, just like we expungeDeletes.

If the first step means that in order to update a field used for scoring (i.e. w/ norms) means that you need to replace the content of the field entirely by a new content, I'm ok with it. As one esteem member of this community always says "progress, not perfection" - I'm totally soled for that !

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think its progress if we add a design that can only work with omitTFAP and no norms, and can only update individual terms, but not fields.

it means to support these things we have to also totally clear out whats there, and then introduce a new design

In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

can only update individual terms, but not fields

Who said that? One of the update operations I listed is REPLACE_FIELD, which means replace the field's content entirely with the new content.

I don't think its progress if we add a design that can only work with omitTFAP and no norms

I never said that will be the design. What I said is that in order to update a field at the term level, we'll start with such fields only. The rest of the fields (i.e. w/ norms, payloads and what not) will be updated through REPLACE_FIELD. The way I see it, we still address all the issues, only for some fields we require a whole field replace, and not an optimized term-based update. That can be improved further along, or not.

In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

That is my idea of incremental field updates and I'm not sure that it's not your idea as well :). You seem to only want to support REPLACE_FIELD, while I say that for some field types we can support UPDATE_FIELD (i.e. at the term level), that's it !

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

I had a chat about this with Robert a couple of days ago, figured it'll be easier to discuss the differences in approaches/opinions, rather than back and forth JIRA comments. Our idea of incremental field updates is not much different. Robert stressed that in his opinion we should tackle first the REPLACE_FIELD operation, which replaces the content of a field entirely by a new content, because he believes that's the most common scenario (i.e., update the title field). I believe that term-based updates are very important too, at least in the scenarios that I face (i.e. adding/removing one ACL, one social tag, one category etc.).

We concluded that the design should take REPLACE_FIELD into consideration from the get go. Whether we'll also implement UPDATE_FIELD (or UPDATE_TERMS as a better name?) depends on the complexity of it. Because initially UPDATE_TERMS can be implemented through REPLACE_FIELD, so we don't lose functionality. UPDATE_TERMS can come later as an optimization.

Robert, if I misrepresented our conclusions, please correct me.

asfimport commented 12 years ago

apache / lucene

Incremental Field Updates through Stacked Segments [LUCENE-4258] #5328

5341

4910