apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Sampling can break FacetResult labeling [LUCENE-5016] #6080

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

When sampling FacetResults, the TopKInEachNodeHandler is used to get the FacetResults.

This is my case: A FacetResult is returned (which matches a FacetRequest) from the StandardFacetAccumulator. The facet has 0 results. The labelling of the root-node seems incorrect. I know, from the StandardFacetAccumulator, that the rootnode has a label, so I can use that one.

Currently the recursivelyLabel method uses the taxonomyReader.getPath() to retrieve the label. I think we can skip that for the rootNode when there are no children (and gain a little performance on the way too?)


Migrated from LUCENE-5016 by Rob Audenaerde, resolved May 30 2013 Attachments: LUCENE-5016.patch, test-labels.zip

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Can you attach a simple testcase exposing the problem? Not sure that I follow what's wrong. About not labeling, I doubt it will gain us anything. Labeling is not very expensive, and labels are LRU-cached. Also, considering all the work that's done during search processing, the labeling part is less than marginal.

asfimport commented 11 years ago

Rob Audenaerde (migrated from JIRA)

Now that I wrote the tests, I realise that maybe the behaviour of the StandardFacetAccumulator is incorrect, as it labels a FacetResult of a Facet that does not exist in the taxonomy...

The behaviour of the SamplingAccumulator and the Standard differ.

For my use case, it is very helpful if all the FacetRequests return a FacetResult with the same label as the request, but I can imagine that this is not desired.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I am not near the code and actually read the test in Notepad :). It looks like you're indexing 100K docs with categories A/docnum and then ask to count the categories "A" and "B". If I understand correctly, the assert in the end fails?

Basically, the FacetRestult that you get back should have the same label as the request. If it's not like that (and I will validate that when I'm near the code), then it's probably a bug in SamplingAccumulator.

BTW the test actually indexed 200K docs while passing 'j' which is 0 for the first 100K and 1 for the second. But 'j' seems to be unused in addDocument. This shouldn't affect the test behavior but just FYI.

Thanks for reporting this, I'll take a deeper look later.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I checked that and indeed there is inconsistency here. StandardFacetsAccumulator and FacetsAccumulator return an empty result with the root node labeled, while the sampling accumulators return the root node not labeled. There isn't anything technically wrong here, because the category does not exist, but I think we should be consistent.

I was able to reproduce this behavior with an even simpler test Rob: index a single document with category "A" and ask to count category "B". The problem is as follows:

Perhaps at some point of the code lifecycle this additional labeling was needed, I'm not sure :). But I think we should either remove the call to label the results in SamplingAccumulator, or at least not call taxoReader.getPath if the node.label is not null. For instance, if you ask to count "A" (which does exist), then labeling happens twice, once by SFA.accumulate and second time by SamplingAccumulator, which is just a waste.

I'll attach later a short test which reproduces this and checks all existing accumulators.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch fixes the following:

I think it's ready - I intend to commit it tomorrow.

asfimport commented 11 years ago

Gilad Barkai (migrated from JIRA)

Patch looks good. +1 for commit

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed to trunk and 4x. Thanks Rob for reporting this!

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk close resolved 4.4 issues