Support sparse faceting for heterogeneous indices [LUCENE-5333]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.64k stars 1.02k forks source link

Support sparse faceting for heterogeneous indices [LUCENE-5333] #6397

Open asfimport opened 10 years ago

asfimport commented 10 years ago

In some search apps, e.g. a large e-commerce site, the index can have a mix of wildly different product categories and facet dimensions, and the number of dimensions could be huge.

E.g. maybe the index has shirts, computer memory, hard drives, etc., and each of these many categories has different attributes.

In such an index, when someone searches for "so dimm", which should match a bunch of laptop memory modules, you can't (easily) know up front which facet dimensions will be important.

But, I think this is very easy for the facet module, since ords are stored "row stride" (each doc lists all facet labels it has), we could simply count all facets that the hits actually saw, and then in the end see which ones "got traction" and return facet results for these top dims.

I'm not sure what the API would look like, but conceptually this should work very well, because of how the facet module works. You shouldn't have to state up front exactly which facet dimensions to count...

Migrated from LUCENE-5333 by Michael McCandless (@mikemccand), 1 vote, updated Nov 11 2013 Attachments: LUCENE-5333.patch (versions: 3)

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

A simple way to achieve that is to write an AllFacetsAccumulator which:

Wraps a FacetsAccumulator and takes no FacetRequests
Creates FacetRequests for each child of ROOT (by asking for the children of ROOT)
Delegates .accumulate() to the wrapped FacetsAccumulator
In the end it goes of the List<FacetResult> and removes any FRes which has no children (meaning no interesting facets were returned).

Some drawbacks to this:

It's currently not clear how you can get the children of ROOT with SortedSetDocValues since it doesn't implement a TaxonomyReader. Maybe SSDVReaderState could have a getAllDims() method?
SortedSetDVAccumulator only does counting, but if you pass a TaxonomyFA, you might be interested to do other aggregations. I guess we could at first only do counting and think about other aggregation methods later on, but that means it needs to create CountingFR explicitly.
It may not be very efficient if e.g. you have 10s or 100s of dimensions, with a total number of categories that's huge, because what the method does is it will traverse the children of each dimension, cause it cannot tell up front if a dimension had or not any children. We could resolve that later.

What do you think?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm if we wrap another FacetsAccumulator, is it the user's job to first create that accumulator (with no facet requests)? But then how do we then create another one, with all the facet requests we derived from ROOT? I guess we could just switch on the N types we have? But then maybe we should just add static methods to each to make this "All" accumulator for each?

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Duh, good point! :).

I think then an AllFacetsAccumulator.create() with two variants - one that takes the members required to create TaxonomyFA and another to create SortedSetFA, and then return a FacetsAccumulator which extends either of the two, would work better ... but limited to CountingFacetRequest. I think, since it's so simple, we may not even need to bother with other aggregation functions, as this will be an example to how to achieve this functionality at all, and then an app could copy the code to create other FacetRequests (e.g. SumScoreFacetRequest)?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial patch, just adds sparse faceting to SortedSetDVAccumulator.

I didn't add any new AllFacetsAccumulator; I think this is sort of overkill? Instead I added another ctor to SortedSetDVA that takes no FacetSearchParams and just takes the int number of top labels to return per dim.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Why is it an overkill? Same AllFacetsAccumulator can be used for SortedSet as well as Taxonomy based faceting. We don't need to duplicate the logic between them. Also, I think that adding that directly to SSDVAccumulator or TaxonomyFacetsAccumulator kind of makes "sparse faceting" the norm rather than the outlier. I don't think it's that common usecase and having a dedicated accumulator seems better to me. It certainly keeps those accumulators' code focused on what they're supposed to do, not complicating their code (to the casual reader).

It should be a very simple accumulator, like what you did in .accumulate(), only will allow us to improve things in the future. We could even factor the logic into FA.create(), if it receives a null list of requests it allocates the proper AllFacetsAccumulator.

So I'm curious - did you try a dedicated class and ran into troubles?

About the code in the patch:

Is there a reason to not allocating the CFRs up front and setting them on the FSP? I mean, the only difference between that and what you do now is that the CFRs now are allocated transiently, but I don't think that should be an issue in general (like how many dims will those apps have?). Also, it might allow in the future to extend this support to sampling too.
You sort the FacetResult based on the FResNode.value (their root). Does SortedSet always assign a value to the root of a FacetResult.node?

If you don't feel like handling the Taxonomy case at the moment, that's fine. I still think we should add an AllFacetsAccumulator with a .create() which wraps an SSDV. We can add Taxonomy faceting to it later (though I hope it's just means another .create()).

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Why is it an overkill?

Well, I think the facet module already has too many classes / abstractions: aggregators, accumulators, ordinal policies, search params, indexing params, cat paths, encoders, decoders, etc. I think this (huge API surface area) is a big impediment to users adopting it and devs contributing to it.

So, I really don't want to make this worse, by adding yet another Accumulator, that has static factory methods, to create yet other Accumulators that are subclasses of existing Accumulators. I think it's too much.

I also don't like separating concerns: I think that's a sign that something is wrong. I don't think a single class (AllFA) should be expected to handle both taxonomy based and SSDV based cases.

We already have classes that count facets using those two methods, so I think we should just add this capability to each of those classes.

And, if we add the enum facet method (and others), then the natural place to add sparse handling for it would be to its own class, I think.

So I'm curious - did you try a dedicated class and ran into troubles?

No, I haven't tried: I just didn't really like that approach... so I focused on the impl instead ...

Is there a reason to not allocating the CFRs up front and setting them on the FSP?

I really don't like the approach of "create CFR for every possible dim". I realize this is a simple way to implement it, but it seems wrong. And I especially don't want the API to expose that we are somehow doing this: it's an impl detail.

So I wanted to get "closer" to not creating all CFRs up-front, and doing it "transiently" seemed at least a bit better than bringing the entire list into existence.

But I think I can improve on the patch so that we don't even make a CFR until we see that any labels had non-zero count ... I'll work towards that.

You sort the FacetResult based on the FResNode.value (their root). Does SortedSet always assign a value to the root of a FacetResult.node?

Yes, it does, in the sparse case (I ignore the ord policy).

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Well, I think the facet module already has too many classes

That's unrelated. It's like saying Lucene has many APIs: IndexWriter, IndexWriterConfig, Document, Field, MergePolicy, Query, QueryParser, Collector, IndexReader, IndexSearcher... just to name a few :). What's important here is FacetAccumulator and FacetRequest .. that's it. The rest are totally unrelated.

This scenario fits into another accumulator. Or else, we'll end up with facet code diverging left and right. Even now, for really no good reason, if you choose to index facets using SortedSetDV, you can only count them. Why? What prevents these ords from weighted by SumScore or a ValueSource? Nothing I think? So I'm worried that if you add this to only SortedSetDV, it will increase the difference between the two.

Rather, I prefer to pick the right API. We say that FacetsAccumulator is your entry point to accumulating facets. So far we've made FacetsAccumulator.create adhere to all existing FacetRequests and accumulators and return the proper one. I think that's a good API? And if all an AllFA needs to do is create dummy requests and filter out the not interesting ones, why complicate the code of all other accumulators (existing and future ones)? Won't it be simpler to add EnumFacetsAccumulator support to AllFA?

Look, this is not a rocket science feature. Besides that I don't think it's such an important or common feature, I think the app doesn't really need to go out of its way to support it – it can easily create all possible FRs using very simple API, and filter out FacetResults whose FRN.subResults is empty. Can we make a simple utility for these apps - I'm all for it! But I prefer that we don't complicate the code of existing FAs.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Here's the simple way I thought about - AllFacetsAccumulator takes no requests, has two ctors - one for SSDV and another for TaxoReader and initializes the proper FA underneath. To which it delegates the .accumulate, and later filters out any FRes with no children.

It's just a means for showing how I think it should be done. Still need to integrate it into FA.create, if we want to simplify an app's life even more, though I'd prefer to wait for some feedback from anyone that actually uses it first.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

There's also a third option:

We add the getDimensions API to SSDVReaderState
We put an example under demo/ExploreFacetsExample (or better name)
- We basically demonstrate how to create a List<FR> for all available dimensions using either TaxoReader or SSDVReaderState
- And we show how to filter out the empty ones

If one day someone will ask how to do it and the example won't be enough, we can think about porting it as an FA or inlined into the other FAs. But until then, it's really a simple example.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

I talked with Gilad about it and he suggested a nice solution, with some limitations – you can create whatever FacetRequest, e.g. CountFacetRequest over the ROOT category and set its depth to 2. That way, if we ask for numResults=10, you basically say "give me the top-10 dimensions (children of ROOT) and for each its top-10 children".

This isn't perfect as if you want to get all available dimensions you have to guess what numResults should be set to. And if you ask for a high number, e.g. 100, you ask for the top-100 children of ROOT, and for each its top-100 children. Still, you might not get all dimensions, but it's a very easy way to do this. No need for any custom code. Another limitation is that this is currently supported by TaxonomyFacetsAccumulator, but SortedSetDVAccumulator limits the depth to 1 for all given requests.

In that spirit, I can propose another solution - write a FacetResultsHandler which skips the first level of children and returns a FacetResult which has a tree structure, such that the first level are the dimensions and the second level are the actual children. That way, doing new CountFacetRequest(ROOT, 10).setDepth(2) will result in all available dimensions in the first level, but top-10 for each in the second level. To implement such FacetResultsHandler we'd need to iterate over ROOT's children and compute the top-K for each, using e.g. DepthOneFacetResultsHandler...

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch add AllDimensionsFacetResultsHandler as a quick prototype to how this can be done. I also modified testTaxonomy to use it instead of AllFacetsAccumulator, and it passes.

If we want to proceed with this approach, we can do the following:

Add a new AllDimensionsFacetRequest which either:
- Extends CountFacetRequest, but then we limit it to counting only
- Wraps another FacetRequest so that you can do any aggregation that you want.
- It setDepth(2) internally.
Move FacetResultsHandler into FacetRequest, instead of TaxonomyFacetsAccumulator.createFacetResultsHandler. I'll admit that originally that's where it was (in FR), but I moved it to FA in order to simplify FR implementations. But perhaps it does belong w/ FR...

The only non-trivial part of this is that you get back a FacetResult, whose children are the actual results, so you cannot simply iterate on res.subResults, but need to realize you should iterate on each subResults.subResults. I don't know if this is considered as complicated or not (I didn't find it very complicating, but maybe I'm biased :)).

All-in-all, I think this is somewhat better than the accumulator approach, as it's more intuitive to define a FacetRequest, I think. In the faceted search module, FacetRequest == Query (in the content search jargon), and therefore more user-level than the underlying accumulator.

The downside is that it's not automatically supported by SortedSetDVAccumulator, since the latter doesn't respect any FacetRequest, only CountFacetRequest, and also does not let you specify your own FacetResultsHandler, but I think that that's solvable.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Actually, if we move FacetResultsHandler to FacetRequest and create the new AllDimensionsFR, it doesn't need to setDepth() at all, only override createFacetResultsHandler. And we can add a flattenResults() method to AllDimsFR which takes a FacetResult and returns a List<FacetResult>, to simplify app's life. Just an idea.