Add hierarchical labels to SSDV facets [LUCENE-10250]

asfimport commented 2 years ago

Hi all,

I recently added a new benchmarking task to luceneutil to count facets on a random word chosen from each document which would give us a very high cardinality facet benchmarking compared to the faceting benchmarks we already had. After being merged, @mikemccand pointed out some interesting results in the nightly benchmarks where the BrowseRandomLabelSSDVFacets task was much faster than the BrowseRandomLabelTaxoFacets task.

I was thinking that using SSDV facets instead of taxonomy facets for our use case at Amazon Product Search could potentially lead to some increases in QPS and decreases in index size, but the issue is we use hierarchical labels, and as I understand it, SSDV faceting only supports a 2 level hierarchy as of today. This leads to my question of why is there a limitation like this on SSDV facets? Is hierarchical labels just a feature that hasn't been implemented in SSDV facets yet, or is there some more complex reason that we can't add hierarchical labels to SSDV facets?

Thanks!

Migrated from LUCENE-10250 by Marc D'Mello (@mdmarshmallow), resolved Jan 10 2022 Pull requests: https://github.com/apache/lucene/pull/509

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I can't think of any reason off the top of my head that SSDV facet counting couldn't support hierarchical dimensions but here are a few placed I'd suggest digging into:

SortedSetDocValuesFacetField, which is used to add these fields at indexing time, appears to only support a single "flat" value, so that would need some thought (along with code in FacetsConfig that helps in the indexing).
I think DefaultSortedSetDocValuesReaderState has some baked in assumptions around "flat" data. I would poke around in there as a first stop to see if there's anything fundamentally preventing the extension to hierarchies.

I'd poked around this code a fair amount a few months back so I'll see if I can refresh my memory a bit more and will add some additional info here if I come up with something.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think it would be good to turn the problem around, e.g. why does/should SSDV facets do something special here, and if so, what is really needed?

Servers like solr and elasticsearch to my knowledge get away fine without such a feature, when users need it, sometimes they have to index terms/fields differently. For example: https://cwiki.apache.org/confluence/display/SOLR/HierarchicalFaceting

SSDV is structured behind the scenes to support a variety of stuff: faceting, sorting, grouping, runtime functions, etc. I'm currently not convinced we have to really modify it in a special way to do such stuff: particularly if the problems can be solved by just indexing data differently. We wouldn't want for a more esoteric use-case to add performance costs to everyone?

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I think it would be good to turn the problem around, e.g. why does/should SSDV facets do something special here, and if so, what is really needed?

+1

SSDV is structured behind the scenes to support a variety of stuff: faceting, sorting, grouping, runtime functions, etc. I'm currently not convinced we have to really modify it in a special way to do such stuff: particularly if the problems can be solved by just indexing data differently. We wouldn't want for a more esoteric use-case to add performance costs to everyone?

Also +1. I don't think there's anything in SSDV itself that would prevent us from doing this (or any changes to propose there). I do believe there are assumptions in a few places in the faceting specific implementation on top of SSDV though that is assuming "flat" data (i.e., it assumes the string in the SSDV field is in the form "dim/value"). That's what I think we need to look at in a bit more detail.

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Thanks for the responses! So are you saying that instead of modifying the SSDVs themselves, it would be possible (and preferred) to modify the Lucene faceting code that uses the SSDVs so support these hierarchical labels? I guess that's what I was getting at in the original description of the issue, but my understanding of this code is still somewhat rudimentary so please let me know if I am misunderstanding something here.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If you take the solr approach #1 from that page listed, you can see it has some faceting options such as prefix that I think SSDV doesn't support. But maybe a low-level option like prefix is all that is needed to power the whole feature (this kind of feature can be done efficiently, as prefixes in term space can be converted efficiently into an "ordinal range" up front, etc)

Anyway, i'd certainly like for us to consider exploring these simpler solutions that keep everything flat at the low-level before trying to do something more complicated. They should be easier to prototype and implement, as well.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And yes, to be clear, i'm proposing modifying the SSDVFacets code with any such stuff like that, and keeping SSDV as is (flat), to at least have a simple baseline.

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I took another look at the SSDV faceting code to try to jog my memory a bit more on where these "non hierarchical" assumptions are being applied and where they might get tricky, and I'd encourage you to have a look at what DefaultSortedSetDocValuesReaderState does at construction time. When that "state" object is constructed, it determines the ordinal range for every unique dimension, which is later used when getting counts for specific dimensions. We would need a general way to determine the ordinal range for any given path prefix in order to generalize to hierarchical data (to Robert's point). So when the user, for example, requests the "top-n" child values for a given hierarchical path, we can efficiently get the counts for all "child" ordinals.

But looking at the code in DefaultSortedSetDocValuesReaderState, it appears the question has been asked in the form of a TODO as to whether-or-not that mapping logic could be generalized to hierarchical paths. I don't see any reason why it can't, so I think there's a path forward there without even needed to open up more SSDV capabilities (like finding all ordinals for a given prefix).

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We would need a general way to determine the ordinal range for any given path prefix in order to generalize to hierarchical data (to Robert's point). So when the user, for example, requests the "top-n" child values for a given hierarchical path, we can efficiently get the counts for all "child" ordinals.

Right, because values are sorted, you can compute the ordinal range of some prefix (or startTerm/endTerm) using SDV/SDDV's lookupTerm(BytesRef):

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SortedSetDocValues.java#L73

You can look at the solr faceting code for exact details of how they do it for a prefix.

But high-level, yeah the idea here is that if you index terms in a certain way, due to the sorted order, we really do effectively have a tree already. and we can drill down specific paths of it.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And in case you are curious, that default implementation of lookupTerm is optimized for the default codec. so don't be afraid of the binary search :) Sorting and other stuff uses it, too.

Default codec does binary search first on "blocks" and only decompresses only the one single block needed, then scans to the target: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L1028 https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L1193

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

I'll take a look at the code that you guys pointed to. Thanks for all the info/suggestions!

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

The more I've thought about this, the more I like it actually. I'd like to add a "bigger plus one" {}({}++1?) to exploring this idea. Taxonomy faceting is super useful but seems a little "clunky" to me in its current form (for lack of a better adjective). The whole idea of using a side-car Lucene index to store taxonomy information is neat/creative, but feels slightly "off". We've recently chased optimizations to how we encode information in this index (and in docs) (see: #10490, #11100, LUCENE-10122). We also rely on specific merge policies to ensure ordinal stability in the taxonomy index (see conversation here). While I recognize that there are a couple fundamental differences between taxonomy-based faceting and SSDV faceting, supporting hierarchical labels with SSDV faceting would go a long way towards closing the gap between the two implementations. If we could reach feature parity between these two approaches by supporting hierarchies of arbitrary depth, I think the only real difference becomes "ordinal mapping" penalties. So SSDV still needs to apply a global mapping while taxonomy-based faceting does some extra work when merging segments to avoid the need for global mapping at query time.

This is all a (fairly rambling) way of saying that I'm all in favor of trying to reach feature parity between taxonomy-based faceting and SSDV faceting in hopes that the two ideas could converge at some point into one implementation. Maybe that's overly optimistic, but I'll hold out hope. This would be a nice step in that direction.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

my suggestion would be, it seems to pay off to do this stuff benchmark driven. it has improved both of these faceting approaches a lot so far. We should keep pressing on it.

For this particular case, It seems a bit crazy we have to generate random data to make facet labels to benchmark

We are indexing wikipedia articles, no?

Wikipedia articles have "hierarchical categories" in the data already. For example, https://en.wikipedia.org/wiki/English_language has these categories:

Categories: English language, Analytic languages, English languages, Fusional languages, Germanic languages, Stress-timed languages, Subject–verb–object languages, Cultural globalization

And these have hierarchies: e.g. SVO-languages has sub-categories: https://en.wikipedia.org/wiki/Category:Subject%E2%80%93verb%E2%80%93object_languages

So I think, separately, it would be great to think about improving benchmarking with a more realistic use-case to drive decisions and tuning.

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

I used random words as labels here because from my understanding of this discussion, it seems that we cannot generate new wiki line file docs, so I only had access to the info already in the enwiki-20120502-lines-1k.txt file as a source. Though I agree with your point, the hierarchical categories already in wikipedia would be a good way to test this change.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah, i think you did the right thing to get some basic high-cardinality faceting tested with the main wikipedia.

perhaps another (easier) option for the future, would be to use another dataset such as geonames, and have a more simple standalone benchmark, more along the lines of http://people.apache.org/\~mikemccand/geobench.html ?

it could just do faceting, and really "target" faceting specifically, things like index speed, index size, faceting speed, RAM usage, or whatever. It wouldn't need to do other stuff like indexing title/text bodies that adds to noise.

I just mention geonames because it has obvious simple hierarchical facets too: country, admin1 (province/state), admin2 (municipality/county), etc. And we are using it in benchmarking already https://download.geonames.org/export/dump/readme.txt

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There's also another little mini-hierarchy here with feature class + feature code. Example: R.ST (Road -> Street), R.TRL (Road -> Trail), etc

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I checked the nightly benchmarks again to compare taxo-based faceting to SSDV now that #11100 has been merged onto main, and the two approaches are much more similar in performance now (but SSDV still appears to be a little better). I still think we should move forward with this exploration, but the performance gap has shrunk considerably now that we no longer use a custom binary format in taxo faceting.

taxo-based: https://home.apache.org/\~mikemccand/lucenebench/BrowseRandomLabelTaxoFacets.html

SSDV-baed: https://home.apache.org/\~mikemccand/lucenebench/BrowseRandomLabelSSDVFacets.html

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

I still think we should move forward with this exploration

Good to know, I have been working on a PR for this though progress has been a bit slow since I have been on vacation. I should be publishing it soon though, assuming everything goes smoothly.

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Made a PR here for this issue: https://github.com/apache/lucene/pull/509.

asfimport commented 2 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So I think, separately, it would be great to think about improving benchmarking with a more realistic use-case to drive decisions and tuning.

+1, this is a great idea! I had not realized Wikipedia had this. I'll open a luceneutil issue.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit b4e27f2c638d31473c48503d0def2259b5d5a408 in lucene's branch refs/heads/main from Marc D'mello https://gitbox.apache.org/repos/asf?p=lucene.git;h=b4e27f2

LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets (#509)

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit eb0b1bf9f14bf760b178c6cf09107d6bb01e79e4 in lucene's branch refs/heads/main from Greg Miller https://gitbox.apache.org/repos/asf?p=lucene.git;h=eb0b1bf

Add CHANGES entry for LUCENE-10250

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit c7650cdec2cb4ed6b31c536a8eb2818c868dc42b in lucene's branch refs/heads/branch_9x from Marc D'mello https://gitbox.apache.org/repos/asf?p=lucene.git;h=c7650cd

LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets (#509)

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Merged this onto both main and branch_9x}. Thanks @mdmarshmallow and congrats on your first contribution!

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Thanks @gsmiller!

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

I added BrowseDateSSDVFacets as a new task to luceneutil now that we support hierarchical SSDV facets. Here is a link to the nightly benchmark for this task:

https://home.apache.org/\~mikemccand/lucenebench/BrowseDateSSDVFacets.html

Here is the corresponding taxonomy task:

https://home.apache.org/\~mikemccand/lucenebench/BrowseDateTaxoFacets.html

It seems that these hierarchical SSDV facets are quite a bit slower than taxonomy facets, especially with recent (past 3 months or so) taxonomy improvements. I wonder if there is anything we can do to speed up SSDV faceting as well.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.1.0 release.

apache / lucene

Add hierarchical labels to SSDV facets [LUCENE-10250] #11286