Refactoring on PostingsWriterBase for delta-encoding [LUCENE-5179]

asfimport commented 11 years ago

A further step from #6093.

The short story is, previous API change brings two problems:

it somewhat breaks backward compatibility: although we can still read old format, we can no longer reproduce it;
pulsing codec have problem with it.

And long story...

With the change, current PostingsBase API will be like this:

term dict tells PBF we start a new term (via startTerm());
PBF adds docs, positions and other postings data;
term dict tells PBF all the data for current term is completed (via finishTerm()), then PBF returns the metadata for current term (as long[] and byte[]);
term dict might buffer all the metadata in an ArrayList. when all the term is collected, it then decides how those metadata will be located on disk.

So after the API change, PBF no longer have that annoying 'flushTermBlock', and instead term dict maintains the <term, metadata> list.

However, for each term we'll now write long[] blob before byte[], so the index format is not consistent with pre-4.5. like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we have to write as longA,longB,bytesA.

Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is delta-encoded, after all PulsingPostingsWriter is only a PBF.

For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and itemsInBlock=2, so theoretically we'll finally have three blocks in BTTR: ["a" "b"] ["a1" "a2"] ["b1" "b2"], with this approach, the metadata of term "b" is delta encoded base on metadata of "a". but when term dict tells PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".

So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, FieldInfo, TermState, boolean absolute)', so that during metadata flush, we can control how current term is written? And the term dict will buffer TermState, which implicitly holds metadata like we do in PBReader side.

For example, if we want to reproduce old lucene41 format , we can simple set longsSize==0, then PBF writes the old format (longA,bytesA,longB) to DataOutput, and the compatible issue is solved. For pulsing codec, it will also be able to tell lower level how to encode metadata.

Migrated from LUCENE-5179 by Han Jiang (@sleepsort), resolved Aug 17 2013 Parent: #4142 Attachments: LUCENE-5179.patch

asfimport commented 11 years ago

Han Jiang (@sleepsort) (migrated from JIRA)

Patch for branch3069, tests pass for all 'temp' postings format.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So, the idea with this patch is to go back to letting the PBF encode the metadata for the term? Just, one term at a time, not the whole block that we have on trunk today.

And the reason for this is back-compat? Ie, so that in test-framework we can have writers for the old formats?

One thing that this change precludes is having the terms dict use different encodings than simple delta vInt to encode the long[] metadata, e.g. Simple9/16 or something? But that's OK ... we can explore those later.

It's sort of frustrating to have to compromise the design just for back-compat ... e.g. we could instead cheat a bit, and have the writers write the newer format. It's easy to make the readers read either format right?

But ... I don't understand how this change helps Pulsing, or rather why Pulsing would have trouble w/ the API we have today?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Is it for real back compat or for "impersonation" ?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I believe it's for impersonation. Real back-compat (reader can read the old index format using the new APIs) should work fine, I think?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

we have had imperfect impersonation before (For example PreFlexRWFieldInfosReader).

But the idea was to exercise to the best extent possible: e.g. if somehow we can make a Reader in the "RW" package (impersonator) that subclasses the "real" reader and overrides the term metadata piece, at least we are still testing the postings lists and term bytes and so on.

and the real reader in lucene/core still gets some basic tests from TestBackwardsCompatibility.

asfimport commented 11 years ago

Han Jiang (@sleepsort) (migrated from JIRA)

Is it for real back compat or for "impersonation" ?

bq. Real back-compat (reader can read the old index format using the new APIs) should work fine, I think?

Yes, this should be 'impersonation', but actually the back-compat I mentioned is a weak requirement. I'm not happy with this revert as well, so let's see if we can do something to hack it! :)

The strong requirement is, if we need pulsing work with the new API, there should be something to tell pulsing how to encode each term.

Ideally pulsing should tell term dict longsSize=0, while maintaining wrapped PF's longsSize.

The calling chain is:

 termdict \~\~finishTermA(long[0], byte[]...)\~> pulsing \~\~finishTermB(long[3], byte[]...)\~> wrappedPF

Take the terms=[ "a" "a1" ... ] example, when term "b" is finished:

the wrappedPF will fill long[] and byte[] with its metatdata, and pulsing will instead fills byte[] as its 'fake' metadata. When term is not inlined, pulsing will have to encode wrapped PF's long[] into byte[], but its too early! Since term "b" should be delta-encoded with term "a", and pulsing will never know this...

If we only need pulsing to work, there is a trade off: the pulsing returns wrapped PF's longsSize, and term dict can do the buffering. For Lucene41Pulsing with position+payloads, we'll have to write 3 zero VLong, along with the pulsing byte[] for an inlined term... and it's not actually 'pulsing' then.

asfimport commented 11 years ago

Han Jiang (@sleepsort) (migrated from JIRA)

By the way, Mike, I think this change doesn't preclude the Simple9/16 encoding you mentioned.

You can have a look at the changed TempFSTTermsWriter, here we always pass 'true' to encodeTerm, so PBF will not do any delta encoding. Instead FST takes the responsibility.

When we need to block encode the long[] for a whole term block, term dict can simply buffer all the long[] returned by encodeTerm(...,true), then use the compressin algorithm.

Whether to do VLong/delta encode is still decided by term dict, not PBF.'encodeTerm' only operates 'delta-encode', and provde PBF the chance to know how current term is flushed along with other terms.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

When we need to block encode the long[] for a whole term block, term dict can simply buffer all the long[] returned by encodeTerm(...,true), then use the compressin algorithm.

Oh, that's great!

OK, and I understand how Pulsing makes this tricky (it always does!), so I think the .encodeTerm approach is a good solution? Let's go with that?

I wonder if (later!) we could allow the PBF to determine how many longs are needed on a term by term basis ... e.g. encodeTerm could return an int saying how many longs it used. At decode time the PBF would have to look only at the existing stats for the term (docFreq, tTF) and then tell the terms dict how many longs to decode for that term. But let's not pursue this now ... top priority is getting the branch ready to merge back to trunk!

asfimport commented 11 years ago

Han Jiang (@sleepsort) (migrated from JIRA)

Thanks! I'll commit.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1514978 from @sleepsort in branch 'dev/branches/lucene3069' https://svn.apache.org/r1514978

LUCENE-5179: refactoring on PostingsWriterBase for delta-encoding

apache / lucene

Refactoring on PostingsWriterBase for delta-encoding [LUCENE-5179] #6243