Open asfimport opened 10 years ago
David Smiley (@dsmiley) (migrated from JIRA)
Here's the patch, with randomized testing verification of results. The faceting code doesn't make any date assumptions, and so it should work once a non-date NumberRangePrefixTree subclass comes into existence.
I added a method to NumberRangePrefixTreeStrategy with this signature:
public Facets calcFacets(IndexReaderContext context, final Bits acceptDocs, Shape facetRange, int level)
acceptDocs is a filter of documents to count (null to count all docs). facetRange is a range that will limit the values counted to that provided range (e.g. a start date and end date span). Get one of those easily via toRangeShape(start,end) method. 'level' is the bottom-most prefix tree level to be counted. For example, the level corresponding to a 'day'. There are multiple ways of determining what the level is, namely by lookup given a Calendar field or taking it from an existing Calendar instance.
The response structure, 'Facets', looks like this:
public static class Facets {
//TODO consider a variable-level structure -- more general purpose.
public Facets(int detailLevel) {
this.detailLevel = detailLevel;
}
/** The bottom-most detail-level counted, as requested. */
public final int detailLevel;
/**
* The count of documents with ranges that completely spanned the parents of the detail level. In more technical
* terms, this is the count of leaf cells 2 up and higher from the bottom. Usually you only care about counts at
* detailLevel, and so you will add this number to all other counts below, including to omitted/implied children
* counts of 0. If there are no indexed ranges (just instances, i.e. fully specified dates) then this value will
* always be 0.
*/
public int topLeaves;
/** Holds all the {`@link` FacetParentVal} instances in order of the key. This is sparse; there won't be an
* instance if it's count and children are all 0. The keys are {`@link` LevelledValue} shapes, which can be
* converted back to the original Object (i.e. a Calendar) via {`@link` #toObject(LevelledValue)}. */
public final SortedMap<LevelledValue,FacetParentVal> parents = new TreeMap<>();
/** Holds a block of detailLevel counts aggregated to their parent level. */
public static class FacetParentVal {
/** The count of ranges that span all of the childCount. In more technical terms, this is the number of leaf
* cells found at this parent. Treat this like {`@link` Facets#topLeaves}. */
public int parentLeaves;// (parent leaf count) to be added to all descendants (children)
/** The length of {`@link` #childCounts}. If childCounts is not null then this is childCounts.length, otherwise it
* says how long it would have been if it weren't null. */
public int childCountsLen;
/** The detail level counts. It will be null if there are none, and thus they are assumed 0. Most apps, when
* presenting the information, will add {`@link` #topLeaves} and {`@link` #parentLeaves} to each count. */
public int[] childCounts;//null if there are none.
//assert childCountsLen == childCounts.length
}
}
I've got a toString() on it with concise output that I found nice to look at during debugging.
The patch has some small changes to related classes involved that are mostly little refactorings and/or making things visible that the facets code needs that were previously private. I'd like to make more refactoring/renaming happen around there to make this number/date spatial API a little more friendly. I'm sure it'll be a bit awkward to newcomers.
Shai Erera (@shaie) (migrated from JIRA)
Would it make sense to use the facet/ module Facets and FacetResult? I haven't looked at the patch, but if we can, it will be good since those facets could be used together with other facets (e.g. range faceting on numeric fields).
David Smiley (@dsmiley) (migrated from JIRA)
Hello @shaie. I did take a quick look even before your suggestion now and there isn't much congruency between both. For one thing, the facet module yields a LabelAndValue for each count yet for the faceting I did, I deliberately avoided computing the label String for each value and instead return a int[] for a higher level bucket. So if you facet by day, you'll get a month "parent" with a label and an int array of size 31 (or sometimes 30 or 28). Also, I don't know what to make of the "dim" and "path" stuff. So it doesn't seem like a good fit, and might erroneously suggest to users some sort of integration with the facets module that doesn't actually exist.
Shai Erera (@shaie) (migrated from JIRA)
OK in that case it sounds like it doesn't make much sense then. No need to force anything.
David Smiley (@dsmiley) (migrated from JIRA)
Updated patch to sync with refactors recently committed, and a minor test bug. I think I may commit this to trunk soonish to get Jenkins beating on it... though @ryantxu I'd appreciate a look.
ASF subversion and git services (migrated from JIRA)
Commit 1647727 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1647727
LUCENE-5735: number/date range faceting on NumberRangePrefixTreeStrategy
David Smiley (@dsmiley) (migrated from JIRA)
This patch improves on the trunk committed code:
This patch depends on two recent small spatial refactorings, #7243 and #7244. I'll commit these issues in a bit and let Jenkins beat on it.
ASF subversion and git services (migrated from JIRA)
Commit 1652212 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1652212
LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)
ASF subversion and git services (migrated from JIRA)
Commit 1652254 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1652254
LUCENE-5735: move PrefixTreeFacetCounter up a package
ASF subversion and git services (migrated from JIRA)
Commit 1652270 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1652270
LUCENE-5735: don't run test 10,000 times. if this is needed please change back but add Nightly
David Smiley (@dsmiley) (migrated from JIRA)
The PrefixTreeFacetCounter utility is good; if it doesn't get committed to 5x as part of this issue first, it will for the heatmap one.
There's a bug in NumberRangePrefixTreeStrategy.calcFacets in which all cells above the parent are counted as topLeaves, when really that can only be done if the leaf cell contains the facet range. I have a fix in-progress in which I detect this and if the cell doesn't contain the facet range then I walk the sub-cells and increment the counters on the parent facet cells. There's a rare-ish bug I need to debug still. But thus far there are a few changes pending in my local check-out:
ASF subversion and git services (migrated from JIRA)
Commit 1658300 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1658300
LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)
David Smiley (@dsmiley) (migrated from JIRA)
After the 6x branch is cut, I'm going to delete this from the 6x branch but leave it in master. Even though the current test passes; it's too in-progress. I'll likely get some more time on this to finish it yet but not ATM.
Michael McCandless (@mikemccand) (migrated from JIRA)
@dsmiley you can remove this now from both 6.0.x and 6.x branches?
ASF subversion and git services (migrated from JIRA)
Commit 0b15fd86364dbafe08abc2939d54749e69bca23f in lucene-solr's branch refs/heads/branch_6x from @dsmiley https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0b15fd8
LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now
ASF subversion and git services (migrated from JIRA)
Commit 239eb2400754ac5d31ee11f44e38567623ec8557 in lucene-solr's branch refs/heads/branch_6_0 from @dsmiley https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=239eb24
LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now (cherry picked from commit 0b15fd8)
Gavin McDonald (migrated from JIRA)
Move issue from deprecated 'In Progress' back to 'Open'
The newly added DateRangePrefixTree (DRPT) encodes terms in a fashion amenable to faceting by meaningful time buckets. The motivation for this feature is to efficiently populate a calendar bar chart or heat-map. It's not hard if you have date instances like many do but it's challenging for date ranges.
Internally this is going to iterate over the terms using seek/next with TermsEnum as appropriate. It should be quite efficient; it won't need any special caches. I should be able to re-use SPT traversal code in AbstractVisitingPrefixTreeFilter. If this goes especially well; the underlying implementation will be re-usable for geospatial heat-map faceting.
Migrated from LUCENE-5735 by David Smiley (@dsmiley), updated Aug 09 2018 Attachments: LUCENE-5735__PrefixTreeFacetCounter.patch, LUCENE-5735.patch (versions: 2) Linked issues:
6710
7253