Numeric DocValues Updates [LUCENE-5189]

asfimport commented 11 years ago

In #5328 we started to work on incremental field updates, however the amount of changes are immense and hard to follow/consume. The reason is that we targeted postings, stored fields, DV etc., all from the get go.

I'd like to start afresh here, with numeric-dv-field updates only. There are a couple of reasons to that:

NumericDV fields should be easier to update, if e.g. we write all the values of all the documents in a segment for the updated field (similar to how livedocs work, and previously norms).
It's a fairly contained issue, attempting to handle just one data type to update, yet requires many changes to core code which will also be useful for updating other data types.
It has value in and on itself, and we don't need to allow updating all the data types in Lucene at once ... we can do that gradually.

I have some working patch already which I'll upload next, explaining the changes.

Migrated from LUCENE-5189 by Shai Erera (@shaie), 8 votes, resolved Nov 02 2013 Attachments: LUCENE-5189_process_events.patch (versions: 2), LUCENE-5189.patch (versions: 11), LUCENE-5189-4x.patch (versions: 2), LUCENE-5189-no-lost-updates.patch, LUCENE-5189-renames.patch, LUCENE-5189-segdv.patch, LUCENE-5189-updates-order.patch (versions: 2)

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds numeric-dv field updates capabilities:

IndexWriter.updateNumericDocValue(term, field, value) updates the value of 'field' of all documents associated with 'term' to the new 'value'
When you update the value of field 'f' of few documents, a new pair of .dvd/.dvm files are created, with the values of all documents for that field.
- That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' and the _0.cfs for other fields which were not updated.
- SegmentInfoPerCommit tracks for each field in which 'gen' it's recorded, and SegmentCoreReaders uses that map to read the values of the field from the respective gen.
TestNumericDocValuesUpdates contains a dozen or so testcases which cover different angles, from simple updates, to unsetting values, merging segments, deletes etc. During development I ran into many interesting scenarios :).
ReaderAndLiveDocs.writeLiveDocs applies in addition to the deletes, the field updates too. BufferedDeletes tracks the updates, similar to how it tracks deletes.
SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen').
- I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later.
Segment merging is supported in that when a segment with updates is merged, the correct values are written to the merged segment and the resulting segment has no 'gen' .dvd.
I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. The default impl throws UnsupportedOpException, while Lucene45 implements it.
- I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files.
- There is also a nocommit touching that with a suggestion to forbid createOutput call in TrackingDir if the file is already referenced by an IndexCommit.
- It is important that we break something here so that users/apps pay attention to the new feature – suggestions are welcome!

Few remarks:

For now, only updating by a single term is supported (simplicity).
You cannot add a new field through field update, only update existing fields. This is a 'schema' change, and there are other way to do it, e.g. through addIndexes and FilterAtomicReader. Attempting to support it means that we need to created gen'd FieldInfosFormat also, which complicates matters.
I dropped some nocommits about renaming classes/methods. I didn't want to do it yet, cause it creates an unnecessarily bloated patch. Feel free to comment, we can take care of the renames later.
I will probably create a branch for that feature cause there are some things that need to be take care of (add some tests, finish Codecs support etc.)
Also, I haven't yet benchmarked the effect of field updates on indexing/search ... I will get to it at some point, but if someone wants to help, I promise not to say no :).

I may have forgot to describe some changes, feel free to ask for clarification!

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I forgot to mention – this work started from a simple patch Rob sent me few weeks, so thanks Rob! And also, thanks Mike for helping me get around Lucene core code! At times I felt like I'm literally hammering through the code to get the thing working ;).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' ... I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. ... I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files.

Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware?

SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen').

I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later.

I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy :)

I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen.

Can we refer to this consistently as docValuesGen or something else (I see the patch already does this in some places, but other places its called fieldInfosGen). I dont think this should ever be referred to as fieldInfosGen because, its not a generation for the fieldinfos data, and that would be horribly scary!

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Can we refer to this consistently as docValuesGen

Yes, I think that makes sense. At some point I supported this by gen'ing FieldInfos hence the name, but things have changed since. I'll rename.

Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware?

Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside. I have this code in Lucene45DVProducer:

    final Directory dir;
    if (fieldInfosGen != -1) {
      dir = state.segmentInfo.dir; // gen'd files are written outside CFS, so use SegInfo directory
    } else {
      dir = state.directory;
    }

I think that if we want to mask that away from the Codec entirely, we should somehow tell the Codec the segmentSuffix and the Directory from which to read the file. Would another Directory parameter be confusing (since we also have it in SegReadState)?

I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy

Well ... I don't really know how setNorm worked in 3.x, so I'll do what I think and you tell me if it's crazy or not? :)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside.

Wait... we shouldnt design around this bug though (and it is an api bug). this problem you point out is definitely existing bogusness: I think we should fix this instead so a codec gets a single "directory" and doesnt need to know or care what its impl is, and whether its got TrackingDirectoryWrapper or CFS or whatever around it.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

by the way if we want to fix the state.directory vs state.segmentInfo.dir i would love to help with this, its bothered me forever.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Rename fieldInfosGen to docValuesGen. I think I got all places fixed.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

The problem is that there are two Directories and the logic of where the file is read from depends if it's gen'd or not (so far it has been only livedocs). Maybe what we can do is have CFS revert to directory.openInput if file does not exist? We can do that in a separate issue? If we fix that, then I think we might really be able to "hide" the gen from the Codec cleanly. Actually, if the fix is that simple (CFD.openInput reverting to dir.openInput), I can do it as part of this issue, it's small enough?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think its true we can tackle this in a separate issue: but I think i'd rather have SR/SCR just pass the correct directory always in the segmentreadstate/segmentwritestate to the different codec components (e.g. segmentreadstate.dir is always the 'correct' directory the codec component should use, and even when CFS is enabled, livedocsformat always gets the inner one and so on).

Its ok if we want to have the 'inner dir' accessible in SegmentInfo for SR/SCR to do this: like we could make it package private and everything would then just work?

This would greatly reduce the possibility of mistakes. I think having CFS fall back on its inner directory on FileNotFoundException is less desirable.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch addresses Rob's idea:

ReaderAndLiveDocs and SegCoreReaders set segmentSuffix to docValuesGen and also set SegReadState.directory accordingly (CFS or si.info.dir if dvGen != -1).
All the changes to DVFormat were removed (including the 45Producer/Consumer). I had to fix a bug in PerFieldDVF which ignore state.segmentSuffix (and also resolved a TODO on the way, since it now respects it).
Removed the nocommit in ReaderAndLiveDocs regarding letting TrackingDirWrapper forbid createOutput on a file which is referenced by a commit, since now Codecs are not aware of dvGen at all. As long as they don't ignore segmentSuffix (which they better, otherwise they're broken), they can be upgraded safely to support DVUpdate. We can still do that though under a separate issue, as another safety mechanism.

I wanted to get rid of the nocommit in TestNumericDocValuesUpdates which sets the default Codec to Lucene45 since now presumably all Codecs should support dv-update. But when the test runs with Lucene40 (I haven't tried other codecs, it's the first one that failed), I hit an exception as if trying to write to the same CFS file. Looking at Lucene40DVF.fieldsProducer, I see that it defaults to CFS extension and also Lucene40DVWriter uses hard-coded segmentSuffix="dv" and ignore state.segmentSuffix. I guess that the actual Codec that was used is Lucene40RWDocValuesFormat, otherwise fieldsProducer should have hit an exception. I didn't know our tests pick "old" codecs at random too :). How can I avoid picking the "old" Codecs (40, 42)? I still want to test other codecs, such as Asserting, maybe MemoryDVF (if it's chosen at random).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You can add \@SuppressCodecs({"Lucene40", "SomethingElse", ...}) annotation to the top of your test for this.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Rob. I forgot about SuppressCodecs :). I guess I was confused by why Lucene40 was picked in the first place, as I thought we don't test writing indexes with old Codecs?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In the case of old codecs: what we do is pretty tricky for testing:

we make them read-only officially for the user (so that new segments are written in the latest format, but old segments can still be read).
this has the additional caveat they are not "purely" read-only, because actually we allow liveDocs updates (deletes) against the old formats. so they are "mostly read-only".
tests have read-write versions (like in branch4x: PreFlexRWCodec). These allow in tests for us to override the read-only-ness, and write like the old formats did and read them in transparently in tests.
Of course they cannot support the newest features with this "impersonator" testing we do, but in general we get a lot more test coverage than if we relied solely upon TestBackwardsCompatibility.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds some nocommits and tests that expose some problems:

Problem 1 If you run the test with -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF, you'll hit an exception:

NOTE: reproduce with: ant test  -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF -Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
WARNING: Will linger awaiting termination of 1 leaked thread(s).
Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-TestNumericDocValuesUpdates]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
    at __randomizedtesting.SeedInfo.seed([7651E2AEEBC55BDF]:0)
    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
    at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:133)
    at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:105)
    at org.apache.lucene.index.ReadersAndLiveDocs.writeLiveDocs(ReadersAndLiveDocs.java:389)
    at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:178)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3732)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3401)
    at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

What happens is the test uses RandomCodec and picks MemoryDVF for writing that field. Later, when ReaderAndLiveDocs applies updates to that field, it uses SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for that field Lucene45DVF, because Lucene45Codec always returns that. The way it works during search is that PerFieldDVF.FieldsReader does not rely on the Codec at all, but rather looks up an attribute in FieldInfo which tells it the DVFormat.name and then it calls DVF.forName. But for writing, it relies on the Codec.

I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing anything wrong – per-field is not exposed on Codec API, therefore it shouldn't assume it should do any per-field stuff. But on the other hand, Lucene45Codec instances return per-field DVF based on what the instance says, and don't look at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas?

Problem 2 Robert thought of this usecase: if you have a sparse DocValue field 'f', such that say in segment 1 only doc1 has a value, but in segment 2 none of the documents have values, you cannot really update documents in segment 2, because the FieldInfos for that segment won't list the field as having DocValues at all. For now, I catch that case in ReaderAndLiveDocs and throw an exception. The workaround is to make sure you always have values for a field in a segment, by e.g. always setting some default value. But this is ugly and exposes internal stuff (e.g. segments) to users. Also, it's bad because e.g. if segments 1+2 are merged, you suddenly can update documents that were in segment2 before.

A way to solve it is to gen FieldInfos as well. That will allow us to additionally support adding new fields through field updates, though that's optional and we can still choose to forbid it. If we gen FieldInfos though, the changes I've done to SegmentInfos (recording per-field dvGen) need to be reverted. So it's important that we come to a resolution about this in this issue. This is somewhat of a corner case (sparse fields), but I don't like the fact that users can trip on exceptions that depend whether or not the segment was merged...

Problem 3 FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to numeric updates. If someone has an idea how to reproduce this outside of numeric updates scope, I'll be happy handle this in a separate issue. The problem is that if you add two fields to a document with same name, one as a posting and one as DV, globalFieldNumbers does not record that field name in its docValuesType map. This map however is currently used by IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I wasn's able to reproduce it outside of numeric updates scope.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Regarding problem 1, I don't know if it's a valid solution, but maybe if we recorded a per-field-format map for each SegInfo, Lucene45Codec could initialize its dvFormat accordingly? This is not generic though .. it's like we need to have a Codec.serialize() method which dumps stuff to SegInfo (or returns a BytesRef/String from which it can later initialize itself). We'd then not need the attributes on FieldInfo. We have to somehow employ the same logic as we do in PerFieldDVF.FieldsReader, in PerFieldDVF.FieldsWriter for updating existing segments. Whatever solution we'll do here, will help us when we come to implement field updates for postings.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

BTW, this may generally not be a bad idea, to let the Codec serialize some stuff which is later given to it in Codec.init(BytesRef). E.g. if a Codec is initialized with some parameters that are also important during search (e.g FacetsCodec can be initialized with FacetIndexingParams, which get lost during search because the Codec is initialized with default ctor), this could be a way for it to serialize/deserialize itself. The name will be used for the newInstance(), the rest to initialize the Codec.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Regarding problem 3, Mike helped me construct a simple test which reproduces the bug - I opened #6256 to fix.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Regarding problem 1, I hardwired the test to use Lucene45Codec for now so that I'm not blocked. I thought about Codec.serlize/attributes and now I realize it's not a good idea since those attributes must be recorded per-segment, yet the Codec is single-instance for all segments. We can however record these in SegmentInfo.attributes(). The documentation suggests this is where the Codec should record stuff per-segment. Would it work if PerFieldDVF recorded the per-field-format in SegWriteStage.si.attributes() and read them later, instead of FieldInfo.attributes?

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I had an idea yesterday when thinking about this. Currently (like for deletes) we can update DocValues based on an ID term (by docid is not easily possible with IndexWriter). As the ID term can be anything, you could also use some (group) key that updates lots of documents (like you can delete all documents with a specific term). The current code updates the given field for all those documents to a fixed value. My two ideas are:

we could also support update by query (means like for deletes you provide a query that selects the documents to update)
we could make "modifications" possible: Instead of giving a value that is set for all selected documents, we could provide a "callback" interface that is used to modify the current docvalue (numeric or String) of the document to update and returns a changed value. This would be a one-method interface, so it could be used as closure in Java 8, like writer.updateDocValues(term, value -> value+1); (in Java 6/7 this would be writer.updateDocValues(term, new NumericDocValuesUpdater() \{ public long update(long value) \{ return value+1; \}\});). Servers like Solr or ElasticSearch could implement this interface/closure using e.g. javascript, so one could execute a docvalues update and pass a javascript function applied to every value. We just have to think about concurency: What happens if 2 threads are updating the same value at the same time - maybe this is already handled by the BufferedDeletesQueue!?

I just wanted to write this down in this issue, so we could think about allowing to implement this. Of course the current patch is more important to get the whole game running! The updateable by term/query is just one thing which is often requested by users. The typical example is a webapp where you can vote for a document. In that case one would execute the closure value -> value+1. If we implement this so low level, the whole concurreny should be easier than how it is currently impelemented e.g. in Solr or ES.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I definitely want to add update by query, but in a separate issue. And the callback idea is interesting. This callback would need to also get the docid I guess (it's missing in your API example)?

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This callback would need to also get the docid I guess (it's missing in your API example)?

Of course we could add this. Java 8 would also support this cool syntax, something like: writer.updateDocValues(term, (docid, value) -> value+1);

The Java 8 example here was just syntactic sugar: For all this its only important that it is an interface with only one method that gets as many parameters as needed and returns one value. We automatically get the cool java 8 syntax for users, if we design the callback interface to these guidelines. One common example is the "Comparator" interface in Java. Every Comparator<T> can be written in this cool syntax :-)

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds testStressMultiThreading and testUpdateOldSegments to ensure updating segments written with older format than "Lucene45" is not supported.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great! I review some of the remaining nocommits:

// nocommit no one calls this method, why do we have it? and if we need it, do we need one for docValuesGen too? public void setDelGen(long delGen) {

Nuke it! We only use .advanceNextWriteDelGen (and the patch adds this for DVs too).

// nocommit no one calls this, remove? void clearDelGen() {

Nuke it!

class ReadersAndLiveDocs { // nocommit (RENAME) to ReaderAndUpdates?

+1 for ReaderAndUpdates

// nocommit why do we do that, vs relying on TrackingDir.getCreatedFiles(), // like we do for updates?

That's a good question ... I'm not sure. We in fact already use TrackingDirWrapper (in ReadersAndLiveDocs.writeLiveDocs)... so we could in theory record those files in SIPC and remove LiveDocsFormat.files(). Maybe make this a TODO though?

// nocommit: review! final static int BYTES_PER_NUMERIC_UPDATE = BYTES_PER_DEL_TERM + 2*RamUsageEstimator.NUM_BYTES_OBJECT_REF + RamUsageEstimator.NUM_BYTES_INT + RamUsageEstimator.NUM_BYTES_LONG;

I think it makes sense to start from BYTES_PER_DEL_TERM, but then instead of mapping to value Integer we map to value Map<String,NumericUpdate> whose per-Term RAM usage is something like:

  PTR (for LinkedHashMap, since it must link each entry to the next?)

  Map
    HEADER
    PTR (to array)
    3 INT
    1 FLOAT

  for each occupied Entry<String,NumericUpdate>
    PTR (from Map's entries array) * 2 (overhead for load factor)
    HEADER
    PTR * 2 (key, value)

    String key
      HEADER
      INT
      PTR (to char[])

      ARRAY_HEADER + 2 * length-of-string (field)

    NumericUpdate value
      HEADER
      PTR (to Term; ram already accounted for)
      PTR (to String; ram already accounted for)
      PTR (to Long value) + HEADER + 8 (long)    
      INT

The thing is, this is so hairy ... that I think maybe we should instead use RamUsageEstimator to "calibrate" this? Ie, make a standalone test that keeps adding Term + fields into this structure and measures the RAM with RUE? Do this on 32 bit and on 64 bit JVM, and then conditionalize the constants. You'll still need to add in bytes according to field/term lengths...

+public class SegmentInfoPerCommit { // nocommit (RENAME) to SegmentCommit?

Not sure about that rename ... since this class is just the "metadata" about a commit, not an "actual" commit. Maybe SegmentCommitInfo?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen.

So far all Codecs work. I had to remove an assert from SimpleText which tested that all fields read from the file are in the state.fieldInfos. But it doesn't use that information, only an assert. And SegCoreReader now passes to each DVProducer only the fields it needs to read.

Added some tests too.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks for the review Mike. I nuked the two unused methodsand I like SegmentCommitInfo, so changed the nocommit text.

I changed the nocomit in SIPC to a TODO. Don't think we need to tackle it in this issue.

I'm working on improving the RAM accounting. I want to add to NumericUpdate.sizeInBytes() and count it per-update that is actually added. Then, because it's a Map<Term,Map<String,NumericUpdate>> and the Term and String are both in NumericUdpate already (and will be accounted for in its calculation, only their PTR needs to be taken into account. Also, the sizeInBytes should grow by new entry to outer map only when one is actually added, same for inner map. Therefore I don't think we can have a single constant here, but instead maybe two: for every Entry<Term,Map> added to the outer map and every Entry<String,NumericUpdate> added to the inner map. I think, because I need to compute the shallow sizes only (since Term and String are accounted for in NumericUpdate), it's a single constant for Entry<Object,Object>?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch improves RAM accounting in BufferedDeletes and FrozenBD. I added NumericUpdate.sizeInBytes() so most of the logic is done there. BD adds two constants - for adding to outer Map (includes inner map OBJ_HEADER) and adding an actual update to inner map (excludes map's OBJ_HEADER). Only the pointers are taken into account.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen.

We can't move forward really with this boolean: it only attacks the symptom (puts a HACK in per-field) without fixing the disease (the codec API).

In general if a codec needs to write to and read from FieldInfos/SegmentInfos.attributes, it doesnt work here: this api needs to be fixed.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I don't understand the problem that you raise. Until then, I think that SWS.isFieldUpdate is fine. It works, it's simple, and most importantly, it allows me to move forward. Let's discuss how to improve it even further, but I don't think this is a blocker. We can always improve that later on.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It really doesn't work: its definitely a blocker for me!

This leaves the general api (FieldInfo.attributes and SegmentInfo.attributes) broken for codecs, and only hacks a specific implementation that uses them.

With or without the current boolean, if a numeric docvalues impl puts something in FieldInfo.attributes during an update, it will go into a black hole, because FieldInfos is write-once per-segment (and not per-commit). Same goes with SegmentInfo.attributes.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By the way: the "general" issue is that for updates, its unfortunately not enough to concern ourselves with data, we have to worry about metadata too:

I see at least 4 problems (and i have not thought about it completely):

FieldInfo.attributes: these "writes" by the NumericDocValues impl will be completely discarded during update, because its per-segment, not per-commit.
SegmentInfo.attributes: same as the above
Field doesnt exist in FieldInfo at all: (because the segment the update applies to happens to have no values for the field)
Field exists in FieldInfo, but is incomplete: (because the segment the update applies to, had say a stored-only or stored+indexed value for the field, but no dv one).

PerFieldDVF is just one implementation that happens to use #1. Fixing it is fixing the symptom, thats why I say we really need to instead fix the disease, or things will get very ugly.

The only reasons you dont see more problems with #1 and #2, is that currently its not used very much (only by PerField and back-compat). If we had more codecs exercising the APIs, you would be seeing these problems already.

A perfectly good solution would be to remove these APIs completely for public use (which would solve #1 and #2). PerField(PF/DVF) could write its own .per file instead. Back compat cruft could then use these now-internal-only-APIs (and it wont matter since they dont support updates), or we could implement their hacks in another way.

But this still leaves issues like #3 and #4.

Adding a boolean 'isFieldUpdate' doesn't really solve anything, and it totally breaks the whole concept of the codec being unaware of updates.

It is the wrong direction.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

OK, so now I get your point. The problem is that we pass to Codec FI.attributes with say an attribute 'foo=bar'. The Codec, unaware that this is an update, looks at the given numericFields and decides to encode them using method "bar2", so it encodes into the attributes 'foo=bar2', but those attributes get lost because they're not rewritten to FIS. Do I understand correctly?

Of course, we could say that since the Codec has to peek into SWS.isFieldUpdate, thereby making it updates-aware, it should not encode stuff in a different format, but SWS.isFieldUpdate is not enough to enforce that.

I don't think that gen'ing FIS solves the problem of obtaining the right DVF in the first place. Sure, after we do that, the Codec can put whatever attributes that it wants, they will be recorded in the new FIS.gen.

But maybe we can solve these two problems by gen'ing FIS:

Add FieldInfo.dvGen. The Codec will receive the FieldInfos with their dvGen bumped up.
Codec can choose to look at FI.dvGen and pull the right DVF e.g. like PerField does.
- Or it can choose to completely ignore it, and always write udpates using the new format.
Codec is free to record whatever attributes it wants on this FI. Since we gen FIS, they will be recorded and used by the reader.

What do you think?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We could simply document this as a limitation, today? Ie, that if it's an update, the DVFormat cannot use the attributes APIs. This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise.

Or, we can pursue gen'ing FIS on this patch, but this is going to add a lot of trickiness/complexity; I think it'd be better to explore it separately.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think it's important to solve FIS.gen, either on this issue or a separate one, but before 4.5 is out. Because now SegmentInfos records per-field dvGen and if we gen FIS, this will be recorded by a new Lucene45FieldInfosFormat, and SIS will need to record fieldInfosGen. I actually don't mind to do it in this issue. It's work that's needed and affects NDV-updates (e.g. sparse fields which now hit a too late cryptic exception).

But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5.

I don't think that will be an issue at all.

if we want to iterate and leave the codec APIs broken, I won't object: but simple rule.

Trunk only.

We can't do this kind of stuff on the stable branch at all: Things that get backported there need to be "ready to ship".

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise.

Its not a fair compromise at all.

To me, as a search engine library, this is not progress. its going backwards. Yes: I'm looking at it solely from an API perspective. Yes: others look at things from only features/performance perspective and do not seem to care about APIs.

But as a library, the API is all that matters.

So I just want to make it clear: saying "progress not perfection" is not a good excuse for leaving broken APIs about the codebase and shoving in features as fast as possible: its not progress to me so I simply do not see it that way.

Frankly I am tired of hearing this phrase being used in this way, and when I see it in the future, it will encourage me to take a closer inspection of APIs and do pickier reviews.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above).

It does not solve problem #2 (SegmentInfos.attributes). This API should removed, deprecated, made internal-only, or something like that. Another option is to move this stuff into the commit, but that might be overkill: today this stuff is only used as a backwards-compatibility crutch (i think) to read 3.x indexes: so it can possibly be just removed in trunk right now.

Gen'ing FieldInfos brings about its own set of questions as far as when/how/if any new fieldinfo information is merged and when/how its visible to the codec API. its very scary but I don't see any alternative at the moment.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

It does not solve problem #2 (SegmentInfos.attributes)

Correct. So this API is broken today for LiveDocsFormat (since it's the only updateable thing), but field updates only broaden the broken-ness into other formats (now only DVF, but in the future others too). Correct?

I think that moving this API into the commit is not an overkill. I remember Mike and I once discussed if we can use that API to save per-segment facets "schema details". I don't remember how this ended, but maybe we shouldn't remove it? Alternatively, we could gen SIFormat too ... that may be an overkill though. Recording per-segment StringStringMap in SIS seems simple enough.

Regarding FIS.gen, I honestly thought to keep it simple by writing all FIS entirely in each gen and not complicate the code by writing parts of an FI in different gens and merging them by SR. This is what I plan to do in this issue.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Frankly I am tired of hearing this phrase being used in this way

Actually, I think this is a fair use of "progress not perfection". Either that or I don't understand what you're calling "broken APIs" in the current patch.

As I understand it, what's "broken" here is that you cannot set the attributes in SegmentInfo nor FieldInfo from your DocValuesFormat writer when it's an update being written: the changes won't be saved.

So, I proposed that we document this as a limitation of the SI/FI attributes API: when writing updates, any changes will be lost. For "normal" segment flushes, they work correctly. It'd be a documented limitation, and we can later fix it.

I think this situation is very similar to #6261, which I would also call "progress not perfection": we are adding a new API (SegmentReader.ramBytesUsed), with an initial implementation that we think might be improved by later cutting over to RamUsageEstimator. But I think we should commit the initial approach (it's useful, it should work well) and later improve the implementation.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One option, to solve the "some segments might be missing the field entirely so you cannot update those" would be to have the FieldInfos accumulate across segments, i.e. a more global FieldInfos, maybe written to a separate global file (not per segment).

This way, if any doc in any segment has added the field, then the global FieldInfos would contain it.

Not saying this is an appealing option (there are tons of tradeoffs), but I think it would address that limitation.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, that would also solve the other problems as well?

Ie, the global FieldInfos would be gen'd: on commit we'd write a new FIS file, which all segments in that commit point would use.

Any attribute changes to a FieldInfo would be saved, even on update; new fields could be created via update; any segments that have no documents with the field won't be an issue.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think global FIS is an interesting idea, but per-segment FIS.gen is a lower hanging fruit. I did it once and it was quite straightforward (maybe someone will have reservations on how I did it though):

SIS tracks fieldInfosGen (in this patch, rename all dvGen in SIS to fisGen)
FI tracks dvGen
A new FIS45Format reads/writes each FI's dvGen
ReaderAndLiveDocs writes a new FIS gen, containing the entire FIS, so SR only reads the latest gen to load FIS

I think we should explore global FIS separately, because it brings its own issues, e.g. do we keep FISFormat or nuke it? Who invokes it (probably SIS)? It's also quite orthogonal to that issue, or at least, we can proceed with it and improve FIS gen'ing later with global FIS.

As for SI.attributes(), I think we can move them under SIS. We should open an issue to do that.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The problem that Mike highlights "some segments might be missing the field entirely so you cannot update those", is pretty bad though. Things work differently (i.e. your update may fail) depending on exactly how segment flushes and merges are done.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Correct, that's a problem that Rob identified few days ago and it can be solved if we gen FieldInfos, because ReaderAndLiveDocs will detect that case and add a new FieldInfo, as well as create a new gen for this segment's FIS.I have two tests in TestNumericDVUpdates which currently test that this is not supported – once we gen FIS, we'll change them to assert it is supported.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I've been busy hunting down concurrency bugs and making field updates work with index sorting (thought it will be an easy nocommit to handle... boy, I was wrong!). Patch adds field updates to TestSortingMP as well as improves TestNumericDVUpdates.testSegmentMerges to stress that more.

There are some nocommit (RENAME) which I'd love to get some feedback on. Also there are two standing out nocommits around FieldInfos.gen and an optimization to SegmentCoreReaders/SegmentReader around reusing DVProducers of update gens. I think these two can be handled in separate issues, just to get this issue focused.

I'd like to commit this to trunk, so that I can move on to the other issues. Therefore I'd appreciate some review on the patch.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Fixed NRT support – DVProducers moved from SegmentCoreReaders to SegmentReader, where they are being wrapped by RefCount which keeps track of ref counts (new class I added). When an SR is shared, SR inc-refs the DVProducers that it uses (according to SIPC.getDVGen(field)) and dec-ref on doClose().

All Lucene tests pass, including the new numeric DV updates ones. A review would be appreciated though.

The remaining nocommits are for renames and FieldInfos.gen. I think I'll leave the renames as TODOs, to handle prior to merging to 4x (after this one bakes in trunk), that way avoiding potential messy merges. I can handle FieldInfos.gen either as part of this issue, or a separate one. Preferences?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch looks great! The ref counting is very clean. Maybe add a comment that gen is allowed to be -1, just means the field has no DV updates yet, in SegmentReader when building up the map? And then call doClose() in a finally clause if !success so on exception we don't leak open files.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Mike. Fixed the two ctors to call doClose if (!success), as well as added a comment about gen=-1.

I also fixed the two tests which ensure we don't allow updating a field in a segment where it doesn't exist (testUpdateSegmentWithPostingButNoDocValues() and testUpdateSegmentWithNoDocValues()) to set NoMergePolicy, otherwise the segments could merge and the update becomes legit. This exposes the weirdness of the exception – you may not hit it if segments are merged. Once we gen FIS, these tests will change though, to ensure these cases ARE allowed :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Added some javadocs, converted all nocommits to TODOs. I think it's ready for trunk. I'd like to handle FIS.gen next.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 to go to trunk. thanks Shai.

apache / lucene

Numeric DocValues Updates [LUCENE-5189] #6253