apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Incremental Field Updates through Stacked Segments [LUCENE-4258] #5328

Open asfimport opened 12 years ago

asfimport commented 12 years ago

Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).


Migrated from LUCENE-4258 by Sivan Yogev, 14 votes, updated Jul 01 2014 Attachments: IncrementalFieldUpdates.odp, LUCENE-4258.branch.1.patch, LUCENE-4258.branch.2.patch, LUCENE-4258.branch.4.patch, LUCENE-4258.branch.5.patch, LUCENE-4258.branch.6.patch (versions: 2), LUCENE-4258.branch3.patch, LUCENE-4258.r1410593.patch, LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch, LUCENE-4258.r1423010.patch, LUCENE-4258-API-changes.patch Linked issues:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A few things I dont understand:

In truth I think term is too fine-grained of a level for updates in lucene because of norms, only updating whole fields of a document will really work (as then the norm can simply be recomputed and replaced).

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) – I'll try to summarize below a more detailed proposal:

API Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument):

Implementation The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted.

Following that semantics, a segment could be comprised of these files:

Two options to encode the posting lists:

Ideally, the way incremental updates will be applied will follow how deletes are applied today:

Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed?

If it's a new Codec, then SegmentReader may not even need to change ...

The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk – as a sparse bit vector that marks the documents that are no longer associated with it ...

I also think that we should introduce this feature in steps:

  1. Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc. ** Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation. ** This I believe is also the most common use case for incremental field updates
  2. Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ...
  3. Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload
    • Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point.
  4. Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options:
    1. Document a limitation to updating a field with norms, at your own risk.
    2. Enforce REPLACE_FIELD OP on fields with norms.

I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think I'm sold on introducing the feature in steps.

I think its critical for something of this magnitude that we figure out the design totally, up-front, so it will work for the major use-cases. I think its fine to implement in steps if we need though.

Honestly I think we should throw it all out on the table and get to the real problems I think that most people face today:

  1. For many document sizes, use-cases (especially rapidly changing stuff): The real problem is not the speed of lucene reindexing the document, its that the user must rebuild the entire document. Solr solved this by providing an option where you just say "update field X" and internally it reindexes the document from stored fields (for that feature to work, the whole thing must be stored). We shouldn't discard the possibility of implementing cleaner support for a solution like this, which wouldnt complicate indexwriter at all.
  2. A second problem (not solved by the above) is that many people are using scoring factors with a variety of signals and these are changing often. I think unfortunately, people are often putting these in a normal indexed field and uninverting these on the fieldcache, requiring the whole document to be reindexed just because of how they implemented the scoring factor. People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc.

So we can look at several approaches to solving this stuff. I feel like both these problems could be solved via a contrib module without modifying indexwriter at all for many use cases: maybe better if we go for more tight integration. And with those simple approaches I describe above, searching doesn't get any slower.

But if we really feel like we need a "full incremental update API" (i know there are a few use cases where it can help, I'm not discarding that), then I feel like there are a few things I want:

I strongly feel like if we just add these incremental APIs to indexwriter without being careful about these things, the end result could be that people use them without thinking and end out with slower search and worse relevance, thats why I am asking so many questions.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think it's ok if we introduced IFU for DOCS_ONLY at first, throwing exceptions otherwise. E.g., UpdateField override setOmitNorms and such and throws UOE... at first.

Everything else will still work as it is today...

Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

Sorry for the short response, JIRA isn't great with smart phones: -).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

I don't think thats really comparable at all, for two reasons:

  1. Codecs can be considered a "rote" refactoring of the XXXWriter in 3.x. I'm not trying to diminish the value but its just an introduced abstraction layer. Something like this is different in that its algorithmic.
  2. The fact that Codecs only handled postings at first wasn't easy to fix after they were introduced as postings-only. Once they handled postings initially, this was a significant refactoring.

I'm not trying to pick on your proposal, I'm just saying there are things I don't like about the design.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

...which is to update the contents of one field, without reindexing the entire document

I agree, but I distinguish between two operations:

  1. replacing the content of a field entirely with a new content (or remove the field)
  2. update the field's content by adding/removing individual terms

I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.

I disagree. Where I come from, the most common use case where such operation will be useful is when a single change affects hundreds and sometimes thousands of documents. An example is a document library like application which manages folders with ACLs. You can add an ACL to a top-level folder and it affects the entire documents and folder beneath it. That results in reindexing, sometimes, a huge amount of documents.

I don't diminish the use case of updating a field for scoring purposes, not at all. Just saying that starting by supporting one use case is more than supporting no use case.

Now, and this probably stems from my lack of understanding of the Lucene internals – I see "supporting terms that omit TFAP" as a starting point because that is the easiest case, and even that requires a lot of understanding of the internals. After we do that, I'll feel more comfortable discussing other types of updates for other field types ... at least, I'll feel that I have more intelligent things to say :).

Regarding your other concerns, I share them with you, and we of course need to benchmark everything. I don't know how this affect search or not. But those updates will get merged away when segments are merged, so while I'm sure search will be affected, it's not for eternity - only until that segment is merged. And, I think we need to add capability to MergePolicy to findSegmentsForMergeUpdates, just like we expungeDeletes.

If the first step means that in order to update a field used for scoring (i.e. w/ norms) means that you need to replace the content of the field entirely by a new content, I'm ok with it. As one esteem member of this community always says "progress, not perfection" - I'm totally soled for that !

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think its progress if we add a design that can only work with omitTFAP and no norms, and can only update individual terms, but not fields.

it means to support these things we have to also totally clear out whats there, and then introduce a new design

In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

can only update individual terms, but not fields

Who said that? One of the update operations I listed is REPLACE_FIELD, which means replace the field's content entirely with the new content.

I don't think its progress if we add a design that can only work with omitTFAP and no norms

I never said that will be the design. What I said is that in order to update a field at the term level, we'll start with such fields only. The rest of the fields (i.e. w/ norms, payloads and what not) will be updated through REPLACE_FIELD. The way I see it, we still address all the issues, only for some fields we require a whole field replace, and not an optimized term-based update. That can be improved further along, or not.

In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

That is my idea of incremental field updates and I'm not sure that it's not your idea as well :). You seem to only want to support REPLACE_FIELD, while I say that for some field types we can support UPDATE_FIELD (i.e. at the term level), that's it !

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

I had a chat about this with Robert a couple of days ago, figured it'll be easier to discuss the differences in approaches/opinions, rather than back and forth JIRA comments. Our idea of incremental field updates is not much different. Robert stressed that in his opinion we should tackle first the REPLACE_FIELD operation, which replaces the content of a field entirely by a new content, because he believes that's the most common scenario (i.e., update the title field). I believe that term-based updates are very important too, at least in the scenarios that I face (i.e. adding/removing one ACL, one social tag, one category etc.).

We concluded that the design should take REPLACE_FIELD into consideration from the get go. Whether we'll also implement UPDATE_FIELD (or UPDATE_TERMS as a better name?) depends on the complexity of it. Because initially UPDATE_TERMS can be implemented through REPLACE_FIELD, so we don't lose functionality. UPDATE_TERMS can come later as an optimization.

Robert, if I misrepresented our conclusions, please correct me.

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

Seems like in any case we need to have a separation between fields given with UPDATE_FIELD and REPLACE_FIELD. There are two ways I could think of for implementing this separation.

The first is at the segment level, where we can have separate "update" and "replace" segments, where the semantic is that a field in an "update" segment is merged with fields in previous segments, while a field in a "replace" segment ignores previous segments.

The second option is to separate at the field level, choosing one type as the default behavior (maybe this can be configurable) and marking the fields of the non-default type by altering the field name or some other solution.

I lean towards the segment level separation, since it requires less conventions and will probably require less work for Codec implementations to handle.

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

+1

I think this design sounds good! REPLACE_FIELDS should easily be able to update norms correctly, right? Because the full per-field stats are recomputed from scratch. So then scores should be identical: should be a nice simple testcase to create :)

I don't see how UPDATE_FIELDS can do so unless we somehow save the raw stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS should forever be limited to DOCS_ONLY, no norms updating? Positions also seems hard to update, and if the only reason to do so is for payloads... seems like the app should be using doc values instead, and we should (eventually) make doc values updatable?.

I do think this is a common use case (ACLs, filters, social tags)... though I'm not sure how bad it'd really be in practice for the app to simply REPLACE_FIELDS with the full set of tags. I guess if we build REPLACE_FIELDS first we can test that.

The implementation should be able to piggy-back on all the buffering/tracking we currently do for buffered deletes.

I think this change should live entirely above Codec? Ie Codec just thinks it's writing a segment, not knowing if that segment is the base segment, or one of the stacked ones. If the +postings and -postings are simply 2 terms then the Codec need not know...

Seems like only SegmentInfos needs to track how segments stack up, and then I guess we'd need a new StackedSegmentReader that is atomic, holds N SegmentReaders, and presents the merged codec APIs by merging down the stack on the fly? I suspect this (having to use a PQ to merge the docIDs in the postings) will be a huge search performance hit....

I think UnionDocs/AndPositionsEnum (in MultiPhraseQuery.java) is already doing what we want? (Except it doesn't handle negative postings).

What about merging? Seems like the merge policy should know about stacking and should sometimes (aggressively?) merge a stack down?

asfimport commented 12 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here? #4910

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't see how UPDATE_FIELDS can do so unless we somehow save the raw stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS should forever be limited to DOCS_ONLY, no norms updating?

Actually its DOCS_ONLY plus OMIT_NORMS.

Anyway why not start with updating the entire contents of a field as I suggested? It seems to be the most general solution, and there is some discussion about how scoring can work correctly on #4910 (the stats, not just norms).

I do think this is a common use case (ACLs, filters, social tags)... though I'm not sure how bad it'd really be in practice for the app to simply REPLACE_FIELDS with the full set of tags. I guess if we build REPLACE_FIELDS first we can test that.

This is why we should do 'replace contents of a field' first. Its the most well-defined and general.

Its also still controversial, myself I'm not convinced it will actually help most people that think they want it, I think it will just slow down searches.

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

Ok. I think to not confuse though, we should call it UPDATE_TERMS (not FIELDS). Then someone can updateFields() twice, once for all the fields which he wants to REPLACE and second for the fields he just wants to update their terms.

What about merging?

I wrote about it above – MergePolicy will need to take care of these stacked segments, and we'll add something like ,merge/expungeFieldUpdates so the app can call it deliberately.

seems like the app should be using doc values instead, and we should (eventually) make doc values updatable?

I agree we should not UPDATE_TERMS fields that record norms. I'm not sure that every use case of storing info in the payload today can be translated to using DocValues, so I don't want to limit things. So, let's start with UPDATE_TERMS taking care of fields that omit norms. Then, if we handle payload or not for few use cases, can become as an optimization later on. In the meanwhile, apps will just need to replace the entire field.

Progress, not perfection ! :)

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here?

The basic idea is the same. One major difference is that in Andrzej's proposal the stacked updates are added to a new index with different doc IDs, and then the SegmentReader needs to map to the original doc IDs. The plan in this proposal (Shai correct me if I'm wrong) is for the stacked updates not to be stand alone segments. Although they will have the structure of regular segments they will be tightly coupled with the original segment, with doc IDs matching those of the original segment.

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

Working on the details, it seems that we need to add a new layer of information for stacked segments. For each field that was added with REPLACE_FIELDS, we need to hold the documents in which a replace took place, with the number of the latest generation that had the replacement. Name this list the "generation vector". That way, TermDocs provided by StackedSegmentReader for a certain term is a special merge of that term's TermDocs for all stacked segments. The "special" part about it is that we ignore occurrences from documents in which the term's field was replaced in a later generation.

An example. Assume we have doc 1 with title "I love bananas" and doc 2 with title "I love oranges", and the segment is flushed. We will have the following base segment (ignoring positions):

bananas: doc 1 I: doc1, doc 2 love: doc 1, doc 2 oranges: doc2

Now we add to doc 1 additional title field "I hate apples", and replace the title of doc 2 with "I love lemons", and flush. We will have the following segment for generation 1:

apples: doc 1 hate: doc 1 I: doc 1, doc 2 lemons: doc 2 love: doc 2 generation vector for field "title": (doc 2, generation 1)

TermDocs for a few terms:

I propose to initially use PackedInts for the generation vector, since we know how many generations the curent segment has upon flushing. Later we might consider special treatment for sparse vectors.

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

Adding a design proposal presentation, and two patches following the proposal concepts. The first patch includes proposed API changes (does not compile) for, and the other one inner changes for those interested in the implementation details. The second patch contains a new test named TestFieldsUpdates which currently fails.

asfimport commented 12 years ago

Sivan Yogev (migrated from JIRA)

Forgot to mention that the implementation patch still missing many components...

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5?

It is just a tought, but your work plan only mentions Lucene fields. Wouldn't it be easier to start working with DocValues? I guess it would help us get started with document updates and would already solve most use-cases (I'm especially thinking of scoring factors).

asfimport commented 12 years ago

Shai Erera (@shaie) (migrated from JIRA)

We will take care of DocValues too, eventually. I think this can be handled in a separate issue though.

and would already solve most use-cases

I have a problem with that statement. Robert thinks that the most common use cases are to replace a field's content entirely. In our world (Sivan and mine's), updating a field's terms (removing / adding single terms) is the most common use case. And perhaps in your world updating DocValues for scoring purposes is the most common use case.

Therefore I don't think that there is one common use case, and therefore IMO we shouldn't aim at solving one first. Personally, DocValues are relatively new (compared to posting lists and payloads) and therefore I believe that being able to update them should come second (just because I estimate they are not as widely used as the others). But that's just my opinion – obviously one that relies solely on DocValues would state otherwise :).

The design currently doesn't cover DocValues at all. I think, in order to keep this issue focused, we should handle that in a separate issue after we land updateable fields.

On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5?

Good point. Well, you could say that replaceField("f", "value") called as replaceField("f", null) would mean "delete 'f'". That should work, but perhaps we can come up with something more explicit.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch, naive test of adding updates to a single-document segment before or after update working. Working on more complex tests with multiple segments, documents and updates.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch implementing some previously missing parts, with preliminary code to enable field replacements.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch with additional testing and bug fixes.

Currently the term statistics does not take into account field replacements, and therefore term counts are wrong and CheckIndex fails.

I can think of two possible solutions for this. The first is for CheckIndex to identify updated segments and ignore term statistics - is there similar mechanism for deletions?

The other solution is to pre-compute term statistics for updated segments. However, this will be costly - requires going through the entire posting list for every term, and count non-replaced occurrences.

Any suggestions?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch. This patch contains some failing tests, some probably due to problems in my implementation of SegmentReader.getTermVectors(int), and others in handling of stored fields.

Can anyone with knowledge of these two areas check what is it that I do wrong? Thanks.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Solved the problem with stored fields, I understood that in order to have the right number of documents for the stored fields reader the last document must have a stored field. Term Vectors still failing...

asfimport commented 11 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Guys:

It's great that you're tackling this. I wanted to encourage you to be patient about responses, there are just a few people in the universe who understand the details well enough to comment on the code (I'm sure not one of them!) , so it might feel like you're shouting down a well....

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Patch with stored fields bug fixed.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Trying to catch up here ... I just have a bunch of random questions (don't fully understand the patch yet):

Not sure why some files show as all deleted / all added lines, eg at least FrozenBufferedDeletes.java.

Patch also has tabs, which should be spaces... (eg IndexWriter.java).

Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument?

Why do we need SegmentInfoReader.readFilesList? It seems like it's only privately used inside the codec? I'm confused why the "normal" file tracking we have on write is insufficient... oh I see, a single SegmentInfo references all stacked segments too? But since they are written "later" their files won't be automatically tracked ... ok. I wonder if each stacked segment should get its own SegmentInfo, linked to the base segment...

It looks like merge policies don't yet know about / target stacked segments ...

It seems like we don't invert the document updates until the updates are applied? Ie, we just buffer the IndexableField provided by the user and when it's time to apply updates, we then analyzing/invert? How do we track RAM in this case? (Eg the field could be something arbitrary, eg pre-tokenized). Another option is to do the invert and buffer the resulting postings, and then later "replay" them (remapping docIDs) when it's time to apply.

Why does StoredFieldsReader.visitDocument need a Set for ignored fields?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Erick - thanks for the support!

Mike - thanks for the comments, here's an attempt to supply answers:

Regarding formatting and all deletes - I will check and try to fix those.

Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument?

We have ADD and REPLACE for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing?

Why do we need SegmentInfoReader.readFilesList? ...

I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately.

It looks like merge policies don't yet know about / target stacked segments ...

I was planning to have it in another issue. should I create it already?

It seems like we don't invert the document updates until the updates are applied? ...

I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue.

Why does StoredFieldsReader.visitDocument need a Set for ignored fields?

When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument?

We have ADD and REPLACE for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing?

Well I just generally prefer that there is one [obvious] way to do something ... it can cause confusion otherwise, ie users will wonder what's the difference between addDocument and updateFields(Operation.ADD_DOCUMENT, ...)

Why do we need SegmentInfoReader.readFilesList? ...

I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately.

OK.

It looks like merge policies don't yet know about / target stacked segments ...

I was planning to have it in another issue. should I create it already?

Another issue is a good idea! No need to create it yet ... but it seems like it will be important for real usage.

Do we have any sense of how performance degrades as the stack gets bigger? It's more on-the-fly merging at search-time...

I'm worried about that search-time merge cost ... I think it's usually better to pay a higher indexing cost in exchange for faster search time, which makes #5341 a compelling alternate approach...

It seems like we don't invert the document updates until the updates are applied? ...

I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue.

OK that makes sense; we should definitely do whatever's easiest/fastest to get to a dirt path.

We should think through the tradeoffs. I think it may confuse apps that the Field is not "consumed" after IW.updateFields returns, but rather cached and processed later. This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.

It also means NRT reopen is unexpectedly costly, because only on flush will we invert & index the documents, and it's a single-threaded operation during reopen (vs per-thread if we invert up front).

Still it makes sense to do this for starters ... it's simpler.

Why does StoredFieldsReader.visitDocument need a Set for ignored fields?

When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored.

Ahhh right.

Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Let me start with the last question.

Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?

In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields.

Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term.

This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.

This is referred in the Javadoc of updateFields, let me know if there's a better way to address it.

As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple...

The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Patch with some additional bug fixes and more elaborate tests, all working. Ready to commit?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Some existing tests fail with the latest patch, working on fixes.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch, concurrency bugs fixed. All tests pass.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Its exciting to see progress here! I too live in the "world" that Shai speaks of – DOCS_ONLY (w/ no norms). I don't need to update a title field, I need to update ACLs and various categorical "tag" fields that will subsequently influence faceting or filtering.

Hey Rob, early on you made this excellent point:

A second problem (not solved by the above) is that many people are using scoring factors with a variety of signals and these are changing often. I think unfortunately, people are often putting these in a normal indexed field and uninverting these on the fieldcache, requiring the whole document to be reindexed just because of how they implemented the scoring factor. People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc.

So true. What if Lucene had more hooks to make it easier to manage commit-consistency with side-car data? I have no clue what's needed exactly, only that I don't dare do this without such hooks because I fear the complexity. With hooks and documentation, it can become clear how to maintain data alongside Lucene's index, and this opens doors. Like making it easier to store data in something custom (e.g. a DB) instead of stored-fields (won't have to pay needless merge cost), or putting metrics that influence scoring somewhere as you hinted at above.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?

In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields.

You're right, this is up to the codec ... hmm but the API isn't sparse (you have to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how well our current default (Lucene41StoredFieldsFormat) handles it. Have you tested it?

Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term.

OK thanks!

This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.

This is referred in the Javadoc of updateFields, let me know if there's a better way to address it.

Maybe also state that one cannot reuse Field instances, since the Field may not be actually "consumed" until some later time (we should be vague since this really is an implementation detail).

As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple...

OK ... yeah it's not simple!

The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.

Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)).

I suspect we need to do something with merging before committing here.

Hmm I see that StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a TermsEnum and goes and counts/aggregates all terms ... which in general is horribly costly? EG these methods are called per-query to setup the Sim for scoring ... I think we need another solution here (not sure what). Also getDocCount() just returns -1 now ... maybe we should only allow updates against DOCS_ONLY/omitsNorms fields for now?

Have you done any performance tests on biggish indices?

I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2.

Where do we account for the RAM used by these buffered updates? I see BufferedUpdates.addTerm has some accounting the first time it sees a given term, but where do we actually add in the RAM used by the FieldsUpdate itself?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that:

  1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem,
  2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates,
  3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion?

So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this?

Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)).

Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2.

TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

Just like expungeDeletes, I think that we should have collapseFieldUpdates() which can be called explicitly by the app, but also IW should call MP.findSegmentsForFieldUpdates() (or some such name). And it should collapse all updates into the segment, implies rewriting that segment. If we collapse all updates but keep the base segment + a single stacked segment, I don't think that we're doing much. The purpose is to get rid of updates entirely.

Also, regarding statistics. I think that as a first step, we should not go out of our way to return the correct statistics. Just like the stats today do not account for deleted documents, so should the updates. I realize that it's not the same as deleted documents, but it certainly simplifies matters. Stats will be correct following collapseFieldUpdates or regular segment merges.

As a second step, we can try to return statistics including stacked segments more efficiently. I.e., if a term appears in both the base and stacked segment, we return the stats from base. But if it exists only in the stacked segment, we can return the stats from there? I'm not too worried about the stats though, because that's a temporary thing, which gets fixed once updates are collapsed.

And if the MergePolicy will have separate settings for collapsing field updates (I think it should!), then the collapsing could occur more frequently than regular merges (and expunging deleted documents). Also, it will give apps a way to control how often do they want to get accurate statistics.

Can we leave statistics outside the scope of this issue? And for now change CheckIndex to detect that it's a segment with field updates, and therefore check stats from the base segment only? I think it does something like that with deleted documents already, no?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that: 1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem, 2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates, 3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion?

+1

I would also add "4. Inversion of updates is single-threaded", ie once we move inversion into .updateFields it will be multi-threaded again.

So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this?

I think we should somehow re-use the existing code that inverts (eg FreqProxTermsWriter)? Ie, invert into an in-RAM segment, with "temporary" docIDs, and then when it's time to apply the updates, you need to rewrite the postings to disk with the re-mapped docIDs.

I wouldn't do anything special for NRT for starters, meaning, from NRT's standpoint, it opens these stacked segments from disk as it would if a new non-NRT reader was being opened. So I would leave that TODO in SegmentReader as a TODO for now :) Later, we can optimize this and have updates carry in RAM like we do for deletes, but I wouldn't start with that ...

Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)).

Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

Imagine a huge segment that's accumulating updates ... say it has 20 stacked segments. First off, those stacked segments are each tying up N file descriptors on open, right? (Well, only one if it's CFS). But second off, I would expect search perf with 1 base + 20 stacked is worse than 1 base + 1 stacked? We need to test if that's true ... it's likely that the most perf loss is going from no stacked segments to 1 stacked segment ... and then going from 1 to 20 stacked segments doesn't hurt "that much". We have to test and see.

Simply merging that big base segment with its 20 stacked segments is going to be too costly to do very often.

I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2.

TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions.

It should be fine if the FieldInfos don't match? Ie, when comparing the two indices we should not compare field numbers? We should be comparing by only external things like fieldName, which id we had indexed, etc.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Woops, resolved wrong issue :)

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I branched https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4258. The patch is really immense and I think it'll be easier to work on this feature in a separate branch. Committed rev 1425441.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Started switching to the invert-first approach following Mike's advices. My thought was to have a single directory for each fields update, and when flushing do something similar to IndexWriter.addIndexes(IndexReader...) and build the stacked segment. However, I encountered two problems with this approach:

  1. If a certain document is updated more than once in a certain generation, two inverted documents should be merged into one,
  2. extension to 1, where a field added in the first update is to be replaced in the second one. So, what I will try to do in such cases is to move the later updates to a new update generation. This will increase the number of generations, but I think it's a fair price to pay in light of the benefits offered by the invert-first approach.
asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

New patch over the issue branch. Inversion of updated fields done directly when added into a RAMDirectory, and updated segments are created by merging such directories. If more than one update to be applied on same document, the later update is moved to another updated segment.

Still missing:

  1. Implement RAM usage computation for updates,
  2. fix TestFieldReplacements.testIndexEquality().
asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I upgraded branch to trunk (w/o latest patch). TestFieldUpdate fails on norms assertion. I'm not sure exactly what happened but SegmentReader had many strange conflicts, so I hope I didn't screw something up when resolving them. Will try to get to the bottom of it later.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Sivan, I tried to apply the patch to the branch (after the upgrade to trunk) but it does not apply. Some files are missing, some have issues. Can you perhaps bring this patch up to the current branch's version and post another one?

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Thanks Shai for applying the patch, it will take me some time to do adjustments to the changes in trunk, including some fixes to the branch.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Updated patch over branch after merging with trunk. New tests, still with bugs in handling replacements.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed the latest patch to the branch and upgraded branch to latest trunk.

asfimport commented 11 years ago

Sivan Yogev (migrated from JIRA)

Added tests and fixed bugs. The most thorough test is testIndexEquality() which runs a complicated scenario of adding and replacing documents and fields and compares the resulting index to an index created only using addDocument. The IndexWriterConfig used can be either created using new IndexWriterConfig() or newIndexWriterConfig() which introduces randomness. When there is no randomness the test passes, with randomness and -Dtests.seed=FFC28997A6951FFB it fails.