Make it easier to mix different kinds of FacetRequests [LUCENE-4985]

asfimport commented 11 years ago

Spinoff from #6044, where we added a strange class called RangeFacetsAccumulatorWrapper, which takes an incoming FSP, splits out the FacetRequests into range and non-range, delegates to two accumulators for each set, and then zips the results back together in order.

Somehow we should generalize this class and make it work with SortedSetDocValuesAccumulator as well.

Migrated from LUCENE-4985 by Michael McCandless (@mikemccand), resolved Jul 29 2013 Attachments: LUCENE-4985.patch (versions: 3)

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I have been thinking about how to achieve that .. here's a proposal:

Make FacetsAccumulator abstract with following current impls:
- TaxonomyFacetsAccumulator, assumes that TaxoReader is needed, FacetArrays etc.
- SortedSetFacetsAccumulator, assumes that categories were indexed to a SortedSetDVField
- RangeFacetsAccumulator, for computing facet ranges on NumericDV
- MultiFacetsAccumulator allows chaining several ones (basically a generic version of RangeFacetsAccumulatorWrapper)
Add to FacetRequest.createFacetsAccumulator()
- CountFacetRequest, Association**FacetRequest return TaxoFacetsAccumulator
- SortedSetCountFacetRequest returns SortedSetFA (and also verify that the given CategoryPath was actually indexed in a SortedSetDVField)
- RangeFacetRequest returns RangeFacetsAccumulator

This pretty much divides the FacetRequests into the "source" from which they read the facets information. Now we need to handle the different aggregation functions currently supported by the TaxoFacetAcc variants: counting, associations. TaxoFacetAcc will let you specify the FacetsAggregator:

CountFacetRequest will set the aggregator to FastCounting (if possible) or just Counting.
**AssociationFacetRequest will set the aggregator to the matching one
Additional requests can set their own aggregator
FacetsAggregator will need to implement equals() and hashCode()

Then we'll have FacetsAccumulator.create(List<FacetRequest>) which creates the right accumulator:

Group all requests that use the same FacetsAccumulator, so that all RangeFRs are grouped together, all TaxoFacetAcc requests together etc.
For the TaxoFacetAcc requests, it groups them by their aggregator, so that:
- All CountingAggregators that read the same category list are grouped together, separate from ones that do counting yet on a different category list
- All AssociationAggregators are grouped together, by their function, list id etc.
It then creates either a single accumulator, or MultiFacetAccumulator which chains the accumulate call

What do we gain – it's easy for an app to create the right accumulator for a given list of requests. Today it needs to sort of do this logic on its own, which is sometimes impossible (e.g. if it's a component that doesn't know what it's given). Also, the requests are self-descriptive.

What do we lose – today if one wants to count A, B and C using CachedOrdsCountingFacetsAggregator, it needs to override FacetsAccumulator.getAggregator(), once. With this change, he will need to do that for every CountFacetRequest he creates .. I think that's an OK tradeoff, given the situation today which makes apps' life tougher.

I think we'll also need to create an Aggregator (old FacetsAggregator) wrapper. It is still needed by StandardFacetsAccumulator, until we finish the cleanup of sampling, complements counting etc. I'll look into that too, perhaps it can be done separately in a different issue.

Now need to hope I took all the parameters into account, and won't hit a brick wall when trying to implement it :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Didn't expect I'd run into it so quickly, but there are issues with adding createFacetsAccumulator to FacetRequest. Here are some details:

RangeAccumulator only requires List<FacetRequest>
TaxonomyFacetsAccumulator requires: List<FacetRequest>, FacetIndexingParams, TaxonomyReader and IndexReader
SortedSetDVAccumulator requires: List<FacetRequest>, SortedSetDVReaderState

At first I made:

FacetsAccumulator.create(FacetSearchParams, TaxonomyReader, IndexReader)
- FacetSearchParams covers List<FacetRequest> and FacetIndexingParams
FacetRequest.createFacestAccumulator(FacetSearchParams, TxonomyReader, IndexReader).

But this is problematic on several fronts:

A FacetRequest does not need List<FacetRequest> in createFacetsAccumulator(), only FacetIndexingParams
Not all requests need FacetIndexingParams, TaxonomyReader and IndexReader
- While declaring parameters that aren't needed is not a big deal, it does mean that someone will need to pass e.g. taxoReader=null and hope for the best. Especially if someone uses only RangeAccumulator and SortedSetDVAccumulator.
Some requests may need more than that, e.g. SortedSetDVReaderState.

So what's the best solution? It would be best if FacetRequest is self-descriptive so that apps can really just call FacetsAccumulator.create().

If we make FacetRequest take all the parameters it needs in the ctor, we sort of get that. A request is fully self-descriptive and FR.createFacetsAccumulator does not need any arguments. However, each new CountFacetRequest() will now need to pass many parameters, which is annoying and makes the code less readable, IMO.

I was thinking perhaps we can extend FacetSearchParams to include additional parameters. So ctor takes only the facet requests, and you can optionally set FacetIndexingParams, IndexReader, TaxonomyReader, SortedSetDVReaderState. There are minor issues that I think can be resolved easily:

If in the future we'll create an accumulator which requires a different setting, we'll need to add that to FSP. That's easy.
An app may not know which parameters to set on FSP in the first place – sort of easy to fix (on app side) cause it will hit a hard exception, probably NPE or some sort, immediately.
- This does show a slight problem though, as you need to create your FRs and then "remember" to set the right parameters on FSP.

There is another problem with FR.createFacetsAccumulator – the accumulators already take a List<FacetRequest> when created, however at the single FR level, it doesn't know the "global picture" and cannot pass all the requests up front. I was thinking that perhaps it can return a FacetAccumulatorBuilder which has API for addFacetRequest() and build(). FacetAccumulator.create() will call FR.createFacetAccumulatorBuilder() and will group all the requests that return the same builder, and then call builder.build() to get the proper FacetAccumulator instance. This is an orthogonal problem BTW to the fact that this builder will also need to know e.g. the SortedSetDVReaderState up front...

So ... making FR completely self-descriptive by taking all the parameters needed to execute it solves most of the problems – you have to resolve the needed parameters at ctor-time. But, it makes the code uglier. Moving parameters to FacetSearchParams keeps the FR ctors clean, but might be trappy for an app. I'll think about it more, but if someone has an idea how to tackle this ... don't be shy :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I discussed this with Gilad and perhaps the situation today is not that bad, and we can do away with few minor changes. Today one can already construct a FacetsAccumulator given the FacetRequests that he wishes to execute. RangeAccumulatorWrapper attempts to allow you to work with RangeAccumulator and FacetsAccumulator (and it has a TODO to work for SortedSet too). So maybe what we need is the following simple solution:

Make FacetsAccumulator abstract, as outlined above, with implementations for Taxonomy, Range, SortedSet and Multi.
Each of the accumulator impls take the needed parameters in the ctor:
- List<FacetRequest> – the requests that are relevant to this accumulator only.
- SortedSetReaderState for SortedSetDVAccumulator, TaxoReader/FacetIndexingParams etc. for TaxonomyFacetAccumulator.
If you want to combine different requests, you construct the relevant accumulators and wrap w/ MultiFacetsAccumulator.
Two choices w.r.t. taxonomy accumulators:
- Keep them as FacetsAggregator - don't change much code, but keeps some aggregators with irrelevant API, such as rollupValues which is not implemented by the associations aggregators (only relevant for hierarchical facets and counting)
- Convert all of them to TaxonomyFacetsAccumulator which implement some abstract method, so that TaxoFA drives the accumulation and implements the common logic of computing top-k over FacetArrays and labeling. This changes existing aggregators ...

Bottom line is, a FacetRequest documents which accumulator should be used, but does not declare a createAccumulator() method, which leads to all the issues I've described above. Its matching accumulator will also make sure that the list of requests passed to it are "OK" (much like is done today).

This has another advantage of keeping e.g. CountFacetRequest as-is, and support it by both a CountingTaxonomyFacetsAccumulator and SortedSetFacetsAccumulator.

I'll start with this route and see where it leads me.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

And I think that in this process I'll eliminate FacetSearchParams which today define only FacetIndexingParams (relevant to TaxonomyFacetsAccumulator only for now) and List<FacetRequest> which will not be a single list anymore, but per accumulator. Really .. FacetsAccumulator is another form of FacetsCollector, only results are collected once and processed afterwords by each accumulator.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch addresses the following:

Added FacetRequest.createFacetsAggregator(FacetIndexingParams). All requests implement it except RangeFacetRequest which returns null. The method is abstract and documents that you are allowed return null.
TaxonomyFacetsAccumulator: if a FacetRequest returns null from createFacetsAggregator, it throws an exception. Otherwise, it groups the requests into category lists as well as ensures that categories are not over counted. It uses MultiFacetsAggregator (new) and PerCategoryListAggregator (existing) to achieve that.
- That allows passing a combination of requests, e.g. Count(A), Count(B), Count(C), SumScore(A), SumScore(F), SumIntAssociation(D)... and works correctly when e.g. A+B were indexed in the same category list, but C, D and F weren't.
Added FacetsAccumulator.create() variants which support RangeAccumulator and either TaxonomyFacetsAccumulator or SortedSetDocValuesAccumulator. Differences are in the methods signatures.
- Renamed RangeFacestAccumulatorWrapper to MultiFacetsAccumulator. Also, the FacetResults are returned in the order of the given accumulators.
- FacetsAccumulator.create documents that you may receive List<FacetResult> in a different order than you passed in, guaranteeing that all RangeFacetRequests come last.
Modified DrillSideways to take either TaxonomyReader or SortedSetDVReaderState because otherise it cannot be used with SortedSetDV facets. Mike, can you please review it?

These changes simplified e.g. the associations examples, as now FacetsAccumulator.create() takes care of them too, since they implement createFacetsAggregator. Also, any future FacetRequest which will support FacetsAggregator will be supported automatically.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Could you post a patch with --show-copies-as-adds? (The current patch isn't easily applied since there were "svn mv"s involved...). Thanks.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch with --show-copies-as-adds

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is a nice cleanup!

It's still hard to mix all three kinds of facet requests? E.g. I think it's realistic for an app to use SSDV for flat fields (less RAM usage than taxo, important if there are lots of ords), range for "volatile" numeric fields (e.g. time delta based), and taxo for hierarchies.

It seems like we could have a FacetsAccumulator.create that took both SSDVReaderState and TaxoReader and created the right FacetsAccumulator ... and I guess we'd need a SSDVFacetRequest.

Or I guess I can just create the directly MultiFacetsAccumulator myself ... FA.create is just sugar.

This all can wait for a follow-on issue ... these improvements are already great.

Should we move MultiFacetsAccumulator somewhere else (out of .range package)? It's more generic now?

FacetsAccumulator.create documents that you may receive List<FacetResult> in a different order than you passed in, guaranteeing that all RangeFacetRequests come last.

Hmm, can we fix that? (So that the order of the results matches the order of the requests).

Modified DrillSideways to take either TaxonomyReader or SortedSetDVReaderState because otherise it cannot be used with SortedSetDV facets. Mike, can you please review it?

Those changes look good! I think we can now simplify TestDrillSideways (previously it had to @Override getDrillDown/SidewaysAccumulator to use sorted set)?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Adding State to .create() does not simplify life for an app I think, because someone (on the app side) will need to figure out if State should be null or not. I'm worried that users will end up creating State even if they don't need it?

And since MultiFacetAccumulator lets you wrap any accumulator yourself, I think it's fine that these are separate methods, as a first step.

I'm worried about adding SortedSetDVFacetRequest, because unlike Count/SumScore/SumIntAssociation, this request is solely about the underlying source? And it also implies only counting ...

Should we move MultiFacetsAccumulator somewhere else

You're right! It was left there by mistake because I renamed RangeAccumulatorWrapper. Will move.

Hmm, can we fix that? (So that the order of the results matches the order of the requests).

I don't know how important it is ... none of our tests depend on it, and it's not clear to me how to fix it at all. FA.create() is a factory method. If it returns a single Accumulator, then it happens already (order is maintained). MultiFacetAccum loses the order. Maybe if we passed it the list of facet requests it could re-order them after accumulation, but I don't know how important it is... an app can put the List<FacetResult> in a Map, and do lookups? Also, as a generic MultiFA, it's not easier to determine from which FA a source FacetRequest came?

I think we can now simplify TestDrillSideways

You're right. Done.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch with fixed comments.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't know how important it is ... none of our tests depend on it, and it's not clear to me how to fix it at all. FA.create() is a factory method. If it returns a single Accumulator, then it happens already (order is maintained). MultiFacetAccum loses the order. Maybe if we passed it the list of facet requests it could re-order them after accumulation, but I don't know how important it is... an app can put the List<FacetResult> in a Map, and do lookups? Also, as a generic MultiFA, it's not easier to determine from which FA a source FacetRequest came?

OK ...

But, I think we should not document that "range facet requests come last"? Let's leave it defined as undefined? Maybe we should return Collection not List?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

But, I think we should not document that "range facet requests come last"?

Ok I will remove that comment. As soon as we add more accumulators, this comment is not important anyway.

Maybe we should return Collection not List?

Why? I prefer that we don't change that since that will change tests. Many of the tests do results.get(idx). If we don't need to, let's not complicate the users? If an app does pass the requests in known order, it shouldn't suffer. It's only Multi that loses order.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I just think it's a dangerous API if sometimes the order matches and sometimes it doesn't ... but we can pursue this separately ...

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1508043 from @shaie in branch 'dev/trunk' https://svn.apache.org/r1508043

LUCENE-4985: Make it easier to mix different kinds of FacetRequests

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1508046 from @shaie in branch 'dev/branches/branch_4x' https://svn.apache.org/r1508046

LUCENE-4985: Make it easier to mix different kinds of FacetRequests

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed to trunk and 4x. I think we can change .accumulate to return a Map<FacetRequest,FacetResult>, but this affects many of the tests, so let's do that separately.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close

apache / lucene

Make it easier to mix different kinds of FacetRequests [LUCENE-4985] #6049