apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Faceting for DateRangePrefixTree [LUCENE-5735] #6797

Open asfimport opened 10 years ago

asfimport commented 10 years ago

The newly added DateRangePrefixTree (DRPT) encodes terms in a fashion amenable to faceting by meaningful time buckets. The motivation for this feature is to efficiently populate a calendar bar chart or heat-map. It's not hard if you have date instances like many do but it's challenging for date ranges.

Internally this is going to iterate over the terms using seek/next with TermsEnum as appropriate. It should be quite efficient; it won't need any special caches. I should be able to re-use SPT traversal code in AbstractVisitingPrefixTreeFilter. If this goes especially well; the underlying implementation will be re-usable for geospatial heat-map faceting.


Migrated from LUCENE-5735 by David Smiley (@dsmiley), updated Aug 09 2018 Attachments: LUCENE-5735__PrefixTreeFacetCounter.patch, LUCENE-5735.patch (versions: 2) Linked issues:

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Here's the patch, with randomized testing verification of results. The faceting code doesn't make any date assumptions, and so it should work once a non-date NumberRangePrefixTree subclass comes into existence.

I added a method to NumberRangePrefixTreeStrategy with this signature:

public Facets calcFacets(IndexReaderContext context, final Bits acceptDocs, Shape facetRange, int level)

acceptDocs is a filter of documents to count (null to count all docs). facetRange is a range that will limit the values counted to that provided range (e.g. a start date and end date span). Get one of those easily via toRangeShape(start,end) method. 'level' is the bottom-most prefix tree level to be counted. For example, the level corresponding to a 'day'. There are multiple ways of determining what the level is, namely by lookup given a Calendar field or taking it from an existing Calendar instance.

The response structure, 'Facets', looks like this:

public static class Facets {
    //TODO consider a variable-level structure -- more general purpose.

    public Facets(int detailLevel) {
      this.detailLevel = detailLevel;
    }

    /** The bottom-most detail-level counted, as requested. */
    public final int detailLevel;

    /**
     * The count of documents with ranges that completely spanned the parents of the detail level. In more technical
     * terms, this is the count of leaf cells 2 up and higher from the bottom. Usually you only care about counts at
     * detailLevel, and so you will add this number to all other counts below, including to omitted/implied children
     * counts of 0. If there are no indexed ranges (just instances, i.e. fully specified dates) then this value will
     * always be 0.
     */
    public int topLeaves;

    /** Holds all the {`@link` FacetParentVal} instances in order of the key. This is sparse; there won't be an
     * instance if it's count and children are all 0. The keys are {`@link` LevelledValue} shapes, which can be
     * converted back to the original Object (i.e. a Calendar) via {`@link` #toObject(LevelledValue)}. */
    public final SortedMap<LevelledValue,FacetParentVal> parents = new TreeMap<>();

    /** Holds a block of detailLevel counts aggregated to their parent level. */
    public static class FacetParentVal {

      /** The count of ranges that span all of the childCount.  In more technical terms, this is the number of leaf
       * cells found at this parent.  Treat this like {`@link` Facets#topLeaves}. */
      public int parentLeaves;// (parent leaf count) to be added to all descendants (children)

      /** The length of {`@link` #childCounts}. If childCounts is not null then this is childCounts.length, otherwise it
       * says how long it would have been if it weren't null. */
      public int childCountsLen;

      /** The detail level counts. It will be null if there are none, and thus they are assumed 0. Most apps, when
       * presenting the information, will add {`@link` #topLeaves} and {`@link` #parentLeaves} to each count. */
      public int[] childCounts;//null if there are none.
      //assert childCountsLen == childCounts.length
    }
  }

I've got a toString() on it with concise output that I found nice to look at during debugging.

The patch has some small changes to related classes involved that are mostly little refactorings and/or making things visible that the facets code needs that were previously private. I'd like to make more refactoring/renaming happen around there to make this number/date spatial API a little more friendly. I'm sure it'll be a bit awkward to newcomers.

asfimport commented 9 years ago

Shai Erera (@shaie) (migrated from JIRA)

Would it make sense to use the facet/ module Facets and FacetResult? I haven't looked at the patch, but if we can, it will be good since those facets could be used together with other facets (e.g. range faceting on numeric fields).

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Hello @shaie. I did take a quick look even before your suggestion now and there isn't much congruency between both. For one thing, the facet module yields a LabelAndValue for each count yet for the faceting I did, I deliberately avoided computing the label String for each value and instead return a int[] for a higher level bucket. So if you facet by day, you'll get a month "parent" with a label and an int array of size 31 (or sometimes 30 or 28). Also, I don't know what to make of the "dim" and "path" stuff. So it doesn't seem like a good fit, and might erroneously suggest to users some sort of integration with the facets module that doesn't actually exist.

asfimport commented 9 years ago

Shai Erera (@shaie) (migrated from JIRA)

OK in that case it sounds like it doesn't make much sense then. No need to force anything.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Updated patch to sync with refactors recently committed, and a minor test bug. I think I may commit this to trunk soonish to get Jenkins beating on it... though @ryantxu I'd appreciate a look.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1647727 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1647727

LUCENE-5735: number/date range faceting on NumberRangePrefixTreeStrategy

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

This patch improves on the trunk committed code:

This patch depends on two recent small spatial refactorings, #7243 and #7244. I'll commit these issues in a bit and let Jenkins beat on it.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1652212 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1652212

LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1652254 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1652254

LUCENE-5735: move PrefixTreeFacetCounter up a package

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1652270 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1652270

LUCENE-5735: don't run test 10,000 times. if this is needed please change back but add Nightly

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

The PrefixTreeFacetCounter utility is good; if it doesn't get committed to 5x as part of this issue first, it will for the heatmap one.

There's a bug in NumberRangePrefixTreeStrategy.calcFacets in which all cells above the parent are counted as topLeaves, when really that can only be done if the leaf cell contains the facet range. I have a fix in-progress in which I detect this and if the cell doesn't contain the facet range then I walk the sub-cells and increment the counters on the parent facet cells. There's a rare-ish bug I need to debug still. But thus far there are a few changes pending in my local check-out:

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1658300 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1658300

LUCENE-5735: spatial PrefixTreeFacetCounter (for faceting on SpatialPrefixTrees)

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

After the 6x branch is cut, I'm going to delete this from the 6x branch but leave it in master. Even though the current test passes; it's too in-progress. I'll likely get some more time on this to finish it yet but not ATM.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@dsmiley you can remove this now from both 6.0.x and 6.x branches?

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 0b15fd86364dbafe08abc2939d54749e69bca23f in lucene-solr's branch refs/heads/branch_6x from @dsmiley https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0b15fd8

LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 239eb2400754ac5d31ee11f44e38567623ec8557 in lucene-solr's branch refs/heads/branch_6_0 from @dsmiley https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=239eb24

LUCENE-5735: remove NumberRangePrefixTreeStrategy.calcFacets from 6x for now (cherry picked from commit 0b15fd8)

asfimport commented 6 years ago

Gavin McDonald (migrated from JIRA)

Move issue from deprecated 'In Progress' back to 'Open'