apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.7k stars 1.04k forks source link

PerField(DocValues|Postings)Format do not call the per-field merge methods [LUCENE-7456] #8508

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

While porting some old codec code from Lucene 4.3.1, I couldn't get the per-field formats to call upon the per-field merge methods; the default merge method was always being called.

I think this is a side-effect of #6956.

Attached is a patch with a test that reproduces the error and an associated fix that pass the unit tests.


Migrated from LUCENE-7456 by Julien Massenet, resolved Oct 04 2016 Attachments: LUCENE-7456.patch, LUCENE-7456-v2.patch Linked issues:

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, nice catch! Thank you.

Our default codec postings & doc values format doesn't override the merge method, so I guess the impact here on a default Lucene usage is minor. But if you customize your codec then it could matter if you have a better merge impl than the naive default.

I'll review the patch.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's a little spooky how sneaky this patch needs to be, temporarily overwriting MergeState members, adding a FilterFieldsProducer, FilterFieldsInfos.

Can we improve those latter two classes to e.g. reject a field that was not in the restricted set, if you call FilterFieldInfos.fieldInfo or FilterFieldsProducer.terms on an invalid field name?

The FilterFieldsInfos is also in a precarious state, having to include all incoming FieldInfo instances so the numbers are consistent, yet only overriding the iterator.

Methods like FilterFieldInfos.hasProx, etc., are also wrong, which can result in sneaky future bugs for codecs that rely on this.

I don't really like the complexity in this patch: I think this is a little too much sneakiness. Yet, I don't know of a cleaner way to fix the bug.

Stepping back a bit, can you describe the use case motivating allowing your custom codec to override the default merging for doc values / postings?

asfimport commented 8 years ago

Julien Massenet (migrated from JIRA)

I agree that being this sneaky is not ideal, but the only alternative I see would be change the Postings API.

In this patch, I tried to keep modifications constrained to the org.apache.lucene.codecs.perfield package, but I can give a shot at trying to a come up with a cleaner implementation that updates the other APIs if you want.

I've uploaded an updated version of the patch which does not remove the sneakiness but makes the PerFieldMergeState more robust:

As for why we're overriding merge() calls in our codec:

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Julien Massenet, I think this patch is a good step forwards, and we can try to simplify the approach in future issues (progress not perfection!).

I'll fixup the minor failures from ant precommit and push.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit a6a8032c7f079ea59daea0c95e48f69b2986d918 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a6a8032

LUCENE-7456: PerFieldPostings/DocValuesFormat was failing to delegate the merge method

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 796ed508f39683c626d4870a7ab583a222b2c64c in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=796ed50

LUCENE-7456: PerFieldPostings/DocValuesFormat was failing to delegate the merge method

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thank you Julien Massenet!