apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.66k stars 1.03k forks source link

Make it easier to mix different kinds of FacetRequests [LUCENE-4985] #6049

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

Spinoff from #6044, where we added a strange class called RangeFacetsAccumulatorWrapper, which takes an incoming FSP, splits out the FacetRequests into range and non-range, delegates to two accumulators for each set, and then zips the results back together in order.

Somehow we should generalize this class and make it work with SortedSetDocValuesAccumulator as well.


Migrated from LUCENE-4985 by Michael McCandless (@mikemccand), resolved Jul 29 2013 Attachments: LUCENE-4985.patch (versions: 3)

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I have been thinking about how to achieve that .. here's a proposal:

This pretty much divides the FacetRequests into the "source" from which they read the facets information. Now we need to handle the different aggregation functions currently supported by the TaxoFacetAcc variants: counting, associations. TaxoFacetAcc will let you specify the FacetsAggregator:

Then we'll have FacetsAccumulator.create(List<FacetRequest>) which creates the right accumulator:

What do we gain – it's easy for an app to create the right accumulator for a given list of requests. Today it needs to sort of do this logic on its own, which is sometimes impossible (e.g. if it's a component that doesn't know what it's given). Also, the requests are self-descriptive.

What do we lose – today if one wants to count A, B and C using CachedOrdsCountingFacetsAggregator, it needs to override FacetsAccumulator.getAggregator(), once. With this change, he will need to do that for every CountFacetRequest he creates .. I think that's an OK tradeoff, given the situation today which makes apps' life tougher.

I think we'll also need to create an Aggregator (old FacetsAggregator) wrapper. It is still needed by StandardFacetsAccumulator, until we finish the cleanup of sampling, complements counting etc. I'll look into that too, perhaps it can be done separately in a different issue.

Now need to hope I took all the parameters into account, and won't hit a brick wall when trying to implement it :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Didn't expect I'd run into it so quickly, but there are issues with adding createFacetsAccumulator to FacetRequest. Here are some details:

At first I made:

But this is problematic on several fronts:

So what's the best solution? It would be best if FacetRequest is self-descriptive so that apps can really just call FacetsAccumulator.create().

If we make FacetRequest take all the parameters it needs in the ctor, we sort of get that. A request is fully self-descriptive and FR.createFacetsAccumulator does not need any arguments. However, each new CountFacetRequest() will now need to pass many parameters, which is annoying and makes the code less readable, IMO.

I was thinking perhaps we can extend FacetSearchParams to include additional parameters. So ctor takes only the facet requests, and you can optionally set FacetIndexingParams, IndexReader, TaxonomyReader, SortedSetDVReaderState. There are minor issues that I think can be resolved easily:

There is another problem with FR.createFacetsAccumulator – the accumulators already take a List<FacetRequest> when created, however at the single FR level, it doesn't know the "global picture" and cannot pass all the requests up front. I was thinking that perhaps it can return a FacetAccumulatorBuilder which has API for addFacetRequest() and build(). FacetAccumulator.create() will call FR.createFacetAccumulatorBuilder() and will group all the requests that return the same builder, and then call builder.build() to get the proper FacetAccumulator instance. This is an orthogonal problem BTW to the fact that this builder will also need to know e.g. the SortedSetDVReaderState up front...

So ... making FR completely self-descriptive by taking all the parameters needed to execute it solves most of the problems – you have to resolve the needed parameters at ctor-time. But, it makes the code uglier. Moving parameters to FacetSearchParams keeps the FR ctors clean, but might be trappy for an app. I'll think about it more, but if someone has an idea how to tackle this ... don't be shy :).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I discussed this with Gilad and perhaps the situation today is not that bad, and we can do away with few minor changes. Today one can already construct a FacetsAccumulator given the FacetRequests that he wishes to execute. RangeAccumulatorWrapper attempts to allow you to work with RangeAccumulator and FacetsAccumulator (and it has a TODO to work for SortedSet too). So maybe what we need is the following simple solution:

Bottom line is, a FacetRequest documents which accumulator should be used, but does not declare a createAccumulator() method, which leads to all the issues I've described above. Its matching accumulator will also make sure that the list of requests passed to it are "OK" (much like is done today).

This has another advantage of keeping e.g. CountFacetRequest as-is, and support it by both a CountingTaxonomyFacetsAccumulator and SortedSetFacetsAccumulator.

I'll start with this route and see where it leads me.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

And I think that in this process I'll eliminate FacetSearchParams which today define only FacetIndexingParams (relevant to TaxonomyFacetsAccumulator only for now) and List<FacetRequest> which will not be a single list anymore, but per accumulator. Really .. FacetsAccumulator is another form of FacetsCollector, only results are collected once and processed afterwords by each accumulator.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch addresses the following:

These changes simplified e.g. the associations examples, as now FacetsAccumulator.create() takes care of them too, since they implement createFacetsAggregator. Also, any future FacetRequest which will support FacetsAggregator will be supported automatically.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Could you post a patch with --show-copies-as-adds? (The current patch isn't easily applied since there were "svn mv"s involved...). Thanks.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch with --show-copies-as-adds

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is a nice cleanup!

It's still hard to mix all three kinds of facet requests? E.g. I think it's realistic for an app to use SSDV for flat fields (less RAM usage than taxo, important if there are lots of ords), range for "volatile" numeric fields (e.g. time delta based), and taxo for hierarchies.

It seems like we could have a FacetsAccumulator.create that took both SSDVReaderState and TaxoReader and created the right FacetsAccumulator ... and I guess we'd need a SSDVFacetRequest.

Or I guess I can just create the directly MultiFacetsAccumulator myself ... FA.create is just sugar.

This all can wait for a follow-on issue ... these improvements are already great.

Should we move MultiFacetsAccumulator somewhere else (out of .range package)? It's more generic now?

FacetsAccumulator.create documents that you may receive List<FacetResult> in a different order than you passed in, guaranteeing that all RangeFacetRequests come last.

Hmm, can we fix that? (So that the order of the results matches the order of the requests).

Modified DrillSideways to take either TaxonomyReader or SortedSetDVReaderState because otherise it cannot be used with SortedSetDV facets. Mike, can you please review it?

Those changes look good! I think we can now simplify TestDrillSideways (previously it had to @Override getDrillDown/SidewaysAccumulator to use sorted set)?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Adding State to .create() does not simplify life for an app I think, because someone (on the app side) will need to figure out if State should be null or not. I'm worried that users will end up creating State even if they don't need it?

And since MultiFacetAccumulator lets you wrap any accumulator yourself, I think it's fine that these are separate methods, as a first step.

I'm worried about adding SortedSetDVFacetRequest, because unlike Count/SumScore/SumIntAssociation, this request is solely about the underlying source? And it also implies only counting ...

Should we move MultiFacetsAccumulator somewhere else

You're right! It was left there by mistake because I renamed RangeAccumulatorWrapper. Will move.

Hmm, can we fix that? (So that the order of the results matches the order of the requests).

I don't know how important it is ... none of our tests depend on it, and it's not clear to me how to fix it at all. FA.create() is a factory method. If it returns a single Accumulator, then it happens already (order is maintained). MultiFacetAccum loses the order. Maybe if we passed it the list of facet requests it could re-order them after accumulation, but I don't know how important it is... an app can put the List<FacetResult> in a Map, and do lookups? Also, as a generic MultiFA, it's not easier to determine from which FA a source FacetRequest came?

I think we can now simplify TestDrillSideways

You're right. Done.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch with fixed comments.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't know how important it is ... none of our tests depend on it, and it's not clear to me how to fix it at all. FA.create() is a factory method. If it returns a single Accumulator, then it happens already (order is maintained). MultiFacetAccum loses the order. Maybe if we passed it the list of facet requests it could re-order them after accumulation, but I don't know how important it is... an app can put the List<FacetResult> in a Map, and do lookups? Also, as a generic MultiFA, it's not easier to determine from which FA a source FacetRequest came?

OK ...

But, I think we should not document that "range facet requests come last"? Let's leave it defined as undefined? Maybe we should return Collection not List?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

But, I think we should not document that "range facet requests come last"?

Ok I will remove that comment. As soon as we add more accumulators, this comment is not important anyway.

Maybe we should return Collection not List?

Why? I prefer that we don't change that since that will change tests. Many of the tests do results.get(idx). If we don't need to, let's not complicate the users? If an app does pass the requests in known order, it shouldn't suffer. It's only Multi that loses order.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I just think it's a dangerous API if sometimes the order matches and sometimes it doesn't ... but we can pursue this separately ...

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1508043 from @shaie in branch 'dev/trunk' https://svn.apache.org/r1508043

LUCENE-4985: Make it easier to mix different kinds of FacetRequests

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1508046 from @shaie in branch 'dev/branches/branch_4x' https://svn.apache.org/r1508046

LUCENE-4985: Make it easier to mix different kinds of FacetRequests

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed to trunk and 4x. I think we can change .accumulate to return a Map<FacetRequest,FacetResult>, but this affects many of the tests, so let's do that separately.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close