asfimport commented 13 years ago

Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps.

We should fix this, by creating a shared faceting module.

Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse).

Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:

To look at "faceting" as a concrete example, there are big the reasons 
faceting works so well in Solr: Solr has total control over the 
index, knows exactly when the index has changed to rebuild caches, has a 
strict schema so it can make sense of field types and 
pick faceting algos accordingly, has multi-phase distributed search 
approach to get exact counts efficiently across multiple shards, etc...
(and there are still a lot of additional enhancements and improvements 
that can be made to take even more advantage of knowledge solr has because 
it "owns" the index that we no one has had time to tackle)

This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same "needs" and so we really have to address them in creating the shared module.

I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover.

Migrated from LUCENE-3079 by Michael McCandless (@mikemccand), 2 votes, resolved Jun 30 2011 Attachments: facet-userguide.pdf, LUCENE-3079_4x_broken.patch, LUCENE-3079_4x.patch, LUCENE-3079.patch (versions: 4), LUCENE-3079-dev-tools.patch, TestPerformanceHack.java Linked issues:

4170

asfimport commented 13 years ago

Stefan Trcek (migrated from JIRA)

This patch was generated by git and tested to apply with patch -p0 -i LUCENE-3079.patch --dry-run Be patient if anything went wrong.

Review starting points may be

FacetSearcherTest.testSimpleFacetWithIndexSearcher() or
FacetSearcher.facetCollectSearch()

Functions.java may be dismissed in favor of Guava. If you are willing to keep it I'll strip it down to the required parts.

The implementation relies on field cache only, no index scheme, no cached filters etc. It supports

single valued facets (Facet.java)
multi valued facets (Facet.MultiValued.java)
facet filters (see FacetSearcher.java)
evaluation of facet values that would dismiss due to other facet filters (Yonik says Solr calls this "multi-select faceting"). (realized by FacetSearcher.fillFacetsForGuiMode())

Let me explain the last point: For the user a facet query (color==green) AND (shape==circle OR shape==square) may look like

Facet color [ ] (3) red [x] (5) green [ ] (7) blue

Facet shape [x] (9) circle [ ] (4) line [x] (2) square

The red/blue/line facet values will display even though the corresponding documents are not in the result set. Also there is support for filtered facet values with zero results, so users understand why they do not get results.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same "needs"

Sort of an aside, but not really.... specific applications are much easier. A lot more indirection is required in Solr and a schema is needed for pretty much everything. Without the schema, a client would specify "sort=foo desc" and Solr would have no idea how to do that. A specific application just does it because they have the knowledge of what all the fields are. It's why people have gotten along just fine without a schema in Lucene thus far. If you're building another Solr... yes, you need something like a schema.

asfimport commented 13 years ago

Jason Rutherglen (migrated from JIRA)

Schemas should probably be a module that makes use of embedding the field types per-segment, this is something the faceting module could/should use. I think is what #3384 is aiming for? Though I thought there was another Jira issue created by Simon for this as well.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I don't think any Facet module needs to be concerned with Schemas. Instead the module can expose an API which asks for the information it needs to make the best choices. Solr can then provide that information based on its Schema, pure Lucene users can do it however they want.

asfimport commented 13 years ago

Jason Rutherglen (migrated from JIRA)

I don't think any Facet module needs to be concerned with Schemas

Right, they should be field type aware.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

I would like to contribute IBM's faceted search package (been wanting to do that for quite a while). The package supports the following features/capabilities (at a high-level):

Taxonomy index – manages trees of 'categories'. You can view example of taxonomies at e.g. the Open Directory Project.
- It's a Lucene index managed alongside the content index.
- Builds the taxonomy on-the-fly (i.e. as categories are discovered).
- In general it maps a category hierarchy to ordinals (integers). For example, the category /date/2011/06/24 will create the following entry in the taxonomy index:
- /date, ordinal=1
- /date/2011, ordinal=2
- /date/2011/06, ordinal=3
- /date/2011/06/24, ordinal=4
FacetsDocumentBuilder which receives a list of categories that are associated w/ the document (can be of several dimensions) and:
- Fetches the ordinals of the category components from the taxonomy index (adding them to it on-the-fly).
- Indexes them in a (compressed) payload for the document (so for the above category example, 4 payloads will be indexed for the document).
- FDB can be used to augment a Document with other fields for indexing (it adds its own Field objects).
FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and returns the top-K categories for each requested facet:
- The root can denote any node in the category tree (e.g., 'count all facets under /date/2011')
- top-K can be returned for the top most K immediate children of root, or any top-K in the sub-tree of root.
Counting algorithm (at a high-level):
- Fetch the payload for every matching document.
- Increment by 1 the count of every ordinal that was encountered (even for facets that were not requested by the user)
- After all ordinals are counted, compute the top-K on the ones the user requested
- Label the result facets
Miscellaneous features:
- Sampling algorithm allows for more efficient facets counting/accumulation, while still returning exact counts for the top-K facets.
- Complements algorithm allows for more efficient facets counting/accumulation, when the number of results is > 50% of the docs in the index (we keep a total count of facets, count facets on the docs that did not match the query and subtract).
- Complements can be used to count facets that do not appear in any of the matching documents (of this result set). This does not exist in the package though ... yet.
- Facets partitioning – if the taxonomy is huge (i.e. millions of categories), it is better to partition them at indexing time, so that search time is faster and consumes less memory. Note that this is required because of the approach of counting all (allocating a count array) and then keeping only the results of interest.
- Category enhancements allow storing 'metadata' with categories in the index, so that more than simple counting can be implemented:
- weighted facets (built on top of enhancements) allows associating a weight w/ each category, and use smarter counting techniques at runtime. For example, if facets are generated by an analytics component, the confidence level can be set as the category's weight. If tags are indexed as facets (for e.g. generating a tag cloud for the result set), the number of times the document was tagged by the tag can be set as the tag's weight.
- That that facets are indexed in the payloads of documents allows managing very large taxonomies and indexes, without blowing up the RAM at runtime (but incur some search performance overhead). However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes much faster.
- However facets are stored is abstracted though by a CountingList API, so we can definitely explore other means of storing the facet ordinals. Actually, the CountingList API allows us to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all.
- I did not want to dive too deep on the API here, but the runtime API is very extensible and allows one to use FacetsCollector for the simple cases, or lower-level API to get more control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.
The package comes with example code which shows how to use the different features I've mentioned. There are also unit tests for ensuring the example code works :).
The package comes with a very extensive tests suite and is in use by many of our products for a long time, so I can state that it's very stable.
Some rough performance numbers:
- Collection of 1M documents, few hierarchical facets per document (we call it the 'Amazon' case)
- Query matches 49.8% of the docs (\~500K)
- No sampling / complements were used.
- Facets read from disk
- Takes 0.4 seconds to execute the query and count facets.

It will take me a few days to prepare a patch (lots of code) - will upload it as soon as it's ready.

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

Bravo Shai & IBM!

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Bravo Shai & IBM!

+1! This sounds awesome, and I hope will prove how modules will help lucene and solr

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

This is quite another design than the quarter-baked one I've proposed with SOLR-2412 (which is really just a thin wrapper around LUCENE-2369). While maintaining a sidecar index makes the workflow more complicated, I would expect that it is beneficial for re-open speed and scalability.

Technical note: For hierarchical faceting, I find that it is possible to avoid storing all levels in the hierarchy. By maintaining two numbers for each tag, denoting the tag-level and the level for the previous tag that matches, only the relevant tags needs to be indexed (full explanation at https://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/).

Kudos for contributing solid code. I am looking forward to seeing the patch.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Toke for the pointer. I think it's very interesting. We've actually explored in the past storing just the category/leaf, instead of the entire hierarchy, in the document. The search response time was much slower than what I reported above (nearly 2x slowdown). While storing the entire hierarchy indeed consumes more space, it is more performing at search time, and we figure that space today is cheap, and usually search apps are more interested in faster search response times and are willing to spend some more time at indexing and analysis stages.

Nevertheless, the link you provided proposes an interesting way to manage the hierarchy, and I think it's worth exploring at some point. Could be that it will perform better than how we managed it when we indexed just the leaf category for each document. We'd also need to see how to update the taxonomy on the go. For example, it describes that for A/B/C you know that its level is 3 (that's easy) and that the previous category/tag that matches (P) is A. But what if at some point A/B is added to a document? What happens to the data indexed for the doc w/ A/B/C, which now its previous matching category is A/B? It's not clear to me, but could be that I've missed the description in the proposal.

I am very close to uploading the patch. Hopefully I'll upload it by the end of my day.

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

SOLR-2412/LUCENE-2369 were created with the trade-offs (relatively) long startup, low memory, high performance: When the index is (re)opened, the hierarchy is analyzed by iterating the terms (it could be offloaded to index-time, but it is still iterate-the-entire-term-list after each change). This does not play well with real-time, but should be a nice fit for large indexes with low update rate.

As for speed, my theory is that the sparser hierarchy (only the concrete paths) wins due to less counting, but without another solution to compare against it has so far remained a theory. There are some measurements at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ but I find that for hierarchical faceting, small changes to test-setups can easily have vast implications on performance, so they are not comparable to your million-document test.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Attached patch includes the faceted search module. It currently compiles against 3x, so I've put it under lucene/contrib, but after we port it to trunk, it should be under modules/.

There isn't a short way to describe the content of the patch (as you can see, it's huge), so instead I'll give a brief overview of some key packages:

src/examples: contains code of different capabilities of the facets module. I'd start w/ examples/simple.
src/test: contains many tests, great place to start too
o.a.l.facet.taxonomy contains the taxonomy index management code. There are two interfaces TaxonomyWriter/Reader with a LuceneTW/TR impl
o.a.l.facet.index contains the indexing code of the different capabilities (e.g. simpl, enhancements etc.)
o.a.l.facet.search contains the respective search code

Few points:

I've put the ASL on all files.
Marked all code as @lucene.experimental.
After you apply the patch you can run 'ant eclipse' and it will build the facets code in eclipse (no maven integration yet - will need to look into it)
Under o.a.l and o.a.l.util there are several utility classes that are not specific to facetted search, however the facets code uses them. I've kept them there so that we can review and decide whether we want to move them to lucene-core at some point.

TODOs:

After it's on trunk, I think we should explore replacing the payloads w/ DocValues.
Leverage Lucene's superb random testing framework !
There are few TODOs in the code which I think can be addressed following this issue.

I will open follow-on issues for those.

Given the amount of code, I am wondering if perhaps we should commit it as-is, and do more thorough reviews afterwards. The code does not modify any existing code (aside from a tiny change to LuceneTestCase), so I think there's no risk in doing so. I am also not sure that it's sane to review that amount of code (nearly 40K lines) in patch form. What do you think?

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Great contribution Shai. What about putting it into a branch? I think it really does need a thorough review before we put it into trunk.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process?

I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

just some trivial test modifications so the tests work with an unmodified LuceneTestCase:

in some cases, if an assertion failed it would print the seed... but LTC does this.
in other tests, the test wanted to repeat a random sequence, but instead of exposing LTC internals, the test just grabs random.nextLong, makes a new Random from this, and then resets it with .setSeed.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.

+1

My suggestion:

commit to branch 3.x with @experimental.
next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Robert for the fix. This indeed looks better than patching LTC !

Patch for dev-tools only, this time w/ Maven support too. I hope it works well :).

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

My suggestion:

1. commit to branch 3.x with @experimental. 2. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. 3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.

I agree. I'll give it a day or two before I commit, unless everyone agree it can be committed today, in which case I'll happily press the button :).

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

The patch compiles neatly against a 3x stable checkout and the examples worked fine. Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

It's not included. We wrote a proprietary (not extending Benchmark) test and would like to extend Lucene's benchmark for benchmarking facets. I think it will be interesting to define some facets on the Wikipedia collection, and benchmark it.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Shai, MASSIVE PATCH! phew I don't want to review this entire thing in a patch really but at a first glance this looks very good. Lots of tests, documentation etc. and obviously this has been used in production so its seen some cpu cycles :) I think roberts proposed way is good!

+1

My suggestion:

commit to branch 3.x with @experimental. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.

here is my +1

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.

LUCENE-3079	LUCENE-2369
Index build time	52 s	23 s
Memory required for indexing	192 MB	48 MB
First facet request	432 ms	12,000 ms
Best of 5 requests	228 ms	159 ms
Memory usage after faceting (after gc())	21 MB	22 MB

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in

LUCENE-3079	LUCENE-2369
Index build time	752 s	238 s
Memory required for indexing	2500 MB	128 MB
First facet request	3370 ms	147,000 ms
Best of 5 requests	2400 ms	2673 ms
Memory usage after faceting	435 MB	294 MB

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in

LUCENE-3079	LUCENE-2369
Index build time	5317 ms	2563 ms
Memory required for indexing	48 MB	32 MB
First facet request	245 ms	1425 ms
Best of 5 requests	15 ms	8 ms
Memory usage after faceting	1 MB	2 MB

Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3079 has brilliant startup time and slows analyzing time a bit through the whole indexing process. #3445 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage.

Now, #3445 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3079 patches trunk, a proper comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3079 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

You write #4170 which is about "post group faceting", while this issue is LUCENE-3079. I assume you meant the latter, but want to confirm :). You also write #3385 which is about decoupling IW from Analyzers. Are you perhaps referring to a Solr issue, or a different Lucene issue? If so can you please let me know which one?

This is a great test, and it matches more or less the test we've been running. Is it in 'benchmark' form? Can you post it on this issue so I can try the same?

What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, where a dimensions is like "tags", "authors", "date", then do you mean you've requested to count 5 dimensions, or you indexed just one dimension (i.e. one "root") and requested to fetch the top-5 results for it? I assume it's the latter, but again, confirming my understanding.

So assuming I understood correctly the terminology and test setup, you execute one query which matches 50% of the documents and ask to count the top-5 facets under a single "root"/"dimension", and record the time as 'first facet request'. And then you execute it 4-5 additional times, and record 'best of 5 requests'. Do I understand it correctly?

One difference between the two approaches, assuming you're referring to a faceting approach that uses the FieldCache is that by default, the faceting approach here reads everything from disk. So it would be interesting to run w/ the facets-in-memory feature.

I don't know how to relate to the memory usage – on the last test it consumed 50% less than the other approach, on the first it consumed nearly the same and on the second test it consumed 150% more. This is odd. Do you trust this measurement?

The 'first facet request' result is not surprising, because it takes time to warm up the FieldCache (assuming that's what you use).

I am interested in the memory observed for indexing because that too seems fluctuating? I.e., in the second test the difference is nearly x20 more, which is weird.

Also, the difference in indexing time is interesting too, as it too is not very consistent. And I find the x2 factor suspicious - would like to understand it better. Since trunk reports to improve indexing speed by a large factor (nearly 200%), I think it will be wise if we wait with this comparison until I bring the patch up w/ trunk.

I like it that you test the default behavior. I think it's very important that we have the greatest out-of-the-box experience. Since the two approaches read from disk/memory, I first would like to test the in-memory facets using this approach, so we can at least compare the same thing. I know that trunk plays some role here (definitely at indexing time), so we can focus on search time for now.

This is great stuff Toke !

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Edited the issue title (it had an extra 'i' in 'faceting'), as well as added fix versions etc. I intend to commit the patch to the 3x branch either later today or tomorrow.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Created component modules/facet

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Shai: I completely messed up the JIRA-numbers, clearly I need to go home and cool my brain. It is fixed now. Sorry for the inconvenience.

Yes, I only created a single root (one dimension) and requested the top-5 facets. You understood the timing measurements correctly. I am sorry that my table was confusing with regards to memory. The first numbers was the Xmx required for index build (binary search until I got bored), while the second if what the JVM reported after the faceting calls were finished. For #3445 in the middle test, the faceted search required more memory (aka higher Xmx) than index build (which could probably have gotten by with even less).

I do not use field cance for #3445. It holds a compressed list of ordinals for tags for the documents in memory, with a few levels of indirections to handle doublettes. The startup time is basically due to doublette elimination.

Regarding the memory difference, #3445 does not operate at index-time. This means that is it plain Lucene indexing of terms like hierarchy:a/b/c/d. Actually I am surprised that it took 128MB for the larger test and I should probably re-run that with a lower allocation.

My guesstimage, based purely on observation, is that LUCENE-3079 requires heap relative to the taxonomy size at indexing time. At least with the (assumedly default) settings I used. Thus the 22M unique values in test #2 is the cause for the large memory requirement. Looking at the number of unique tags vs. index memory requirements for case #1 and #2, the factor seems nearly linear. It seems to fit your recommendation of splitting on large taxonomies?

I'll upload my test class for LUCENE-3079 now. I apologize for its hackish nature - this was just meant as explorative work.

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

A quick hack to get an idea of performance characteristics for hierarchical faceting.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

resulting in 1.4M unique paths

AND

LUCENE-3079 requires heap relative to the taxonomy size at indexing time

Ok now I see what's happening. The taxonomy index maintains a category cache, which maps a category to its ordinal. It's maintained at both indexing and search time. At indexing time, it's used to quickly determine if an incoming category already exists, and if so returns its ordinal. At search time its used to quickly label ordinals, as well as determining hierarchy information.

The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, which maintains a mapping of all categories to their ordinal, although the mapping is maintained compact. There is another cache LruTaxonomyWriterCache which as its javadocs states "good choice for huge taxonomies".

The taxonomy you create is HUGE by all standards (and I'm not speaking on the 22M case :)). The largest 'normal' taxonomy we've seen was in the order of few 100K nodes (when people names were maintained, and social tags), while the largest 'abnormal' taxonomy we've seen contained \~5M nodes, and that is considered a very extreme case for taxonomies.

Just to be clear, I'm not trying to make excuses. Your perf test is still great in that it tests an extreme case. But I'd expect an extreme case to require some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of let's say 100/200K). Another thing I've mentioned this package can do is the partitions – because at runtime we allocate an array the size of the taxonomy, in your case (1.4M nodes), we'll create an array that is \~6MB size for every query. While if you use partitions, and partition the categories into 10/100K-categories buckets, you'd allocate less RAM, but might incur some search performance overhead.

Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are assigned to each document and how many documents are associated w/ each category? When we ran our test, we used a Zipf distribution for categories. If in this test we end up associating only a couple of documents per category, then this is not a too realistic scenario. And while the package can handle it, by not using defaults, perhaps we should define a scenario that makes sense (a common one that is) and run with it?

I don't think there can be one "right" faceted search solution, but rather a collection of tools that can match different scenarios. And if it turns out that for one case one implementation is better than another, then our job will be to create a faceted search layer which allows the user to choose what's best for him, and leave the rest of his app code unmodified.

What do you think? Perhaps we should open a separate issue, let's call it "facet benchmarking", where we define some scenarios, work on extending the benchmark package (we have done some preliminary work there) and then compare few approaches?

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

About the TaxonomyWriterCache, looking at its API now, I think that an FST-based TWC might be a good fit here? FST is known for its performance and low RAM consumption, and TWC maps from a CategoryPath (or String) to an integer, which sounds like a typical usage for FST. So we can have both the keep-all-in-RAM-TWC and LRU use FST and consume less memory.

I'll open an issue for that.

One small correction - TWC is used only at indexing time, mapping from category->ordinal. For labeling ordinals you use TaxonomyReader which maintains its own int->String cache (LRU). Can FST aid in that case as well? I assume it will consume less space than an Integer->String hash map.

Back to performance – Toke, did you verify that you get the same top-5 categories from both implementations? Also, can you try running the test asking for top-5 categories of a node below the root? I.e., if the paths are in the form /a/b/c/d and you request to count "/a", then I'm interested in how this performs if you ask to count "/a/b".

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Oh Toke ... I just reviewed your test and there is a problem in it:

    CountFacetRequest facetRequest = new CountFacetRequest(
        new CategoryPath(HIERARCHICAL), num);
    facetRequest.setDepth(5);

You create a CountFacetRequest, requesting to count HIERARCHICAL (which is the root), and fetch the top <num>, which is 5. BUT, you set the depth of the request to 5, which means it will compute the top-5 categories at each level !

This is a nice feature of the package, which lets you get not only the top-N child nodes of the immediate "root", but also the top-N of their child nodes and it's applied recursively until 'depth'. This is a nice feature, but not very performance friendly :).

Can you please rerun the test then, commenting out that line? I can run it on my laptop, but I don't have the env. setup w/ the patch from #3445, so I cannot compare.

In general, this looks like a very useful test. I think we can commit it too, but rename it so that it doesn't run regularly w/ our tests, but rather selectively.

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I must admit that my choices of tests were more aimed at probing edge cases than simulating real taxonomies. Nevertheless, it is good to hear that LUCENE-3079 can be tweaked to handle it. I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page?

As for the 1M test case, the number of tags/documents were random with the average being the stated 3.5 tags/document. Seen from the other side, there were an average of 1M*3.5/1.4M \~= 2.5 documents/tag.

I did verify the facet-results and they did fit my expectations. I will try requesting from further down the tree later - hopefully tomorrow. I am a bit confused about your protest on depth=5, but I suspect that we have different ideas of what is relevant when issuing a hierarchical request. The API states that specifying a depth of 5 will count all sub-tags until depth 5. I used the number 5 to effectively count all the way to the bottom (whoops! It should be 6 for the second case. That might explain why LUCENE-3079 was faster than #3445 in that one as #3445 counted to the bottom). The reason for the complete counting was that I implicitly found this to be the "correct" behavior, internally visioning a taxonomy of species or something similar, with the wish to get the number of unique elements at the finest level.

Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page?

Yes. And on the Wiki page you should clearly describe the scenario that is tested, along with results for 'default config' + 'optimized config'. That way, a user coming to the page can pick the scenario that best matches his app and config the facets package (whatever we end up with) accordingly.

Seen from the other side, there were an average of 1M*3.5/1.4M \~= 2.5 documents/tag

That's indeed an extreme case. We've seen it when an analytics module extracted facets automatically from documents (such as places, people etc.), and in that case the taxonomy was very 'flat' and wide.

I am a bit confused about your protest on depth=5

I did not protest :). Actually, the common scenario is to count the immediate children and fetch the top-K. And I thought that that's what you do in #3445. But counting all the way down is a valid scenario - and shows another reason why we should have a benchmark page with clear description.

Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

Indeed. And we shouldn't pursue only 'realistic data', but edge cases too. As long as everything is clearly documented, it should be easy to interpret results.

I think that setting up facet benchmarking is more important than working on improving any implementation. Mostly because it will allow measuring the how much the improvements really improved. I'll open an issue for that.

asfimport commented 13 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for #3445, but again this is just plain Lucene indexing.

LUCENE-3079 (d6)	LUCENE-3079 (d1)	LUCENE-3079 (d1+'deep/a/b')	LUCENE-2369 (d6)
Index build time	752 s	771 s	-	347 s
Memory required for indexing	2500 MB	2500 MB	2500 MB	48 MB
First facet request	3840 ms	1929 ms	1963 ms	147,000 ms
Best of 5 requests	2688 ms	1172 ms	1246 ms	2673 ms
Memory usage after faceting	435 MB	435 MB	435 MB	294 MB

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not.

It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

3445 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)

LUCENE-3079 (d4)	LUCENE-3079 (d1)	LUCENE-2369 (d6)
Index build time	28 min	-	17 min
Memory required for indexing	? MB	? MB	? MB
First facet request	13933 ms	13367 ms	46,000 ms
Best of 5 requests	11718 ms	11036 ms	2989 ms
Memory usage after faceting	240 MB	240 MB	475 MB

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed revision 1141060 (3x). Will start the port to trunk.

I'm also attaching a userguide we wrote which should help newcomers get up to speed w/ the package. It is not meant to be an end-to-end cover of all the functionality and API, but rather as a complementary asset to the Javadocs, example code and source code itself.

I think it will be good if we check it in with the source, e.g. under contrib/facet/docs or something, in ODT (+PDF?) format, and that it will be included in the release binaries (i.e. along with the .jar).

What do you think?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's a hack patch, tried to quickly move this thing to trunk apis... 3 out of the 250 tests fail though, I'm 99% positive i jacked something up in the taxonomywriter (this is the most complicated one to convert), so maybe that one should just be started over.

but maybe some of the patch (except taxonomywriter, again i think its broken) would be useful in getting it ported to 4.x

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch, one of the fails was caused by my use of seekExact, i think at least for MultiTermsEnum, if you seekExact, and even if it finds your term, its not positioned correctly, so if you then call next() its unsafe.

another one of the fails, is calling getPayload() before hasPayload() will not work with SepCodec.

thanks to mike for helping track some of these down.

i also added to build.xml the logic to depend on the analyzers module, now all tests pass (some of the time).

but i have at least one random fail:NOTE: reproduce with: ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtestmethod=testTaxonomyMergeUtils -Dtests.seed=3732021887561370529:1102439953879128238

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the previous fail is somehow a bug in memorycodec (the seed randomly selected it): ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtests.codec=Memory

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

the previous fail is somehow a bug in memorycodec (the seed randomly selected it)

I just committed a fix for this; it was because .getPayload() in MemoryCodec was (incorrectly) assuming caller did not change the .bytes of the returned BytesRef between calls.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks guys for doing this port so quickly. The patch looks good. I suggest that we change the 'nocommit' to TODO (Facet): and commit it (under modules/). Then we can iterate on the TODOs and resolve them one by one, in followup issues. Makes sense?

Robert, would you like to do the honors? :)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch: all tests pass.

I changed the nocommits to TODO (Facet):'s, and added verbage for the reason to each one.

we also have two TODOs for two bugs (The MTE.seekExact and SepCodec hasPayload bug) that we should fix, but currently we have workarounds in place (when we fix these bugs we can then remove the workarounds).

I'll svn move to modules, and doublecheck things like javadoc warnings, and commit later today.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed revision 1141246.

I think we should close this issue soon, and open followup issues? Maybe just start with a separate issue for the documentation guide?

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

I opened #4334 and #4335 to track userguide + benchmark issues. Will update more as we go along.

I think we should close this issue

+1.

We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list?

Great job at porting to trunk Robert !

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Faceting module in 3.x and trunk, tests pass, opened follow up issues. I think we can close this.

Thanks for everyone for helping get this in so quickly !

asfimport commented 11 years ago

Bill Bell (migrated from JIRA)

Has this been integrated into SOLR? Even if we just add it to SOLR as an option with hooks would be good. Or as an override in the lower levels?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Has this been integrated into SOLR?

Hi Bill,

No, not yet ... the separate taxonomy index makes it tricky. Although, the new facet method based on SortedSetDocValues is being used in Solr and is a patch (#5860) to add to the faceting module, so there's a small overlap there...

apache / lucene

Faceting module [LUCENE-3079] #4152

4170