Use Suffix Arrays for fast search with leading asterisks [LUCENE-7639]

asfimport commented 7 years ago

If query term starts with asterisks FST checks all words in the dictionary so request processing speed falls down. This problem can be solved with Suffix Array approach. Luckily, Suffix Array can be constructed after Lucene start from existing index. Unfortunately, Suffix Arrays requires a lot of RAM so we can use it only when special flag is set:

-Dsolr.suffixArray.enable=true

It is possible to speed up Suffix Array initialization using several threads, so we can control number of threads with

-Dsolr.suffixArray.initialization_treads_count=5

This system property can be omitted, the default value is 5.

Attached patch is the suggested implementation for SuffixArray support, it works for all terms starting with asterisks with at least 3 consequent non-wildcard characters. This patch do not change search results and affects only performance issues.

Update suffix-arra-2.patch is an improved version of the first patch, system properties for it are following::

lucene.suffixArray.enable - true, if you want to enable Suffix Array support. Default value - false. lucene.suffixArray.initializationThreadsCount - number of threads for Suffix Array initialization, if you set 0 - no additional threads used. Default value - 5.

Migrated from LUCENE-7639 by Yakov Sirotkin, 1 vote, updated Jun 03 2017 Attachments: suffix-array.patch, suffix-array-2.patch Linked issues:

8914

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Wouldn't it be better to create an fst for all reversed forms of all terms and lookup the leading wildcard's matches by its concrete suffix instead (you'd need to reverse it too)? This wouldn't work for (single) infix wildcards, but these could then be implemented as an intersection between the two (terms matching the leading/trailing set).

Also, there are better (as in: linear) ways of constructing suffix arrays compared to sorting. We used to implement a few of them, if you wanted to reuse the code (it should be optimized for strings/ byte sequences though) [1].

[1] https://github.com/carrotsearch/jsuffixarrays

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

Of cause, index for reversed terms is popular solution for this problem, but it doesn't supports cases when we have asterisks at both ends and it requires much bigger index. Imagine situation when you have 1 read-only storage for index that is used by 10 servers. You can enable suffix array support only at 1 server with bigger RAM size and route all requests with asterisks to it, additional cost for such solution will be pretty small.

Fast construction of suffix array is the most challenging part of this approach and I spent a lot of time trying various algorithms. Unfortunately, linear-time algorithms are too complex, usually its initialization takes longer than 3-way radix quicksort full run. Also performance of suffix array creation is not very important, because current approach transparently uses FST for the segments where suffix array construction hasn't been completed yet. And using multiple threads for suffix arrays construction allows to reduce initialization time to acceptable values. Also 3-way radix quicksort is perfect from RAM usage point of view.

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

[...] and it requires much bigger index

Why is this the case? The inverted FST should be smaller than the suffix array and intersecting both should be a viable option to get **abc** wildcard matches?

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

Inverted FST requires additional space on disk and additional RAM, and Suffix Arrays requires only a lot of (much bigger that inverted FST) additional RAM, so there are different pro et contra for these approaches. And I suppose that it is not possible to process **abc** terms fast with inverted FST.

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sure, I didn't mean to somehow diminish suffix arrays – they're super nice! I'd still be curious whether the thing can be done on a finite state automaton alone.

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

I'd still be curious whether the thing can be done on a finite state automaton alone.

That was my starting point: I tried to trace FST and figure out what is wrong with leading asterisks and how it can be improved. But I suppose that FST is so perfect that any attempt to improve performance for leading asterisks will decrease performance for the rest of requests. And leading asterisk is the FST's Achilles' heel, it iterates over all words in the index and says: "Oops, this is wrong word, let's try the next one!"

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Nothing is wrong with the FST, really. A suffix array is functionally equivalent to a suffix tree, which you could build and encode as an FST. Then any infix matching would be done similarly to suffix array-based lookups. These are all interchangeable concepts. The way we use FSTs in Lucene is just one particular application to solve one particular problem (essentially implement an efficient partial prefix trie), hence my remark on the possibility of using two automata to make an intersection of wildcards possible.

As for your patch, I really don't know how much of a problem (in typical use) prefix and infix wildcards are. While leading wildcards are heavily used for many things (suggestions, etc.) I can't quite remember when I needed a prefix or infix wildcard.

In other words – don't know whether there'll be much interest in maintaining this code if it went into Lucene, especially considering the implementation details you mentioned (amount of memory it requires at runtime, separate construction process, etc.). Let's see what others think though.

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I am fine with adding those optimizations to the index file format (one could save the suffix array as part of the BlockTreeTermsdict?). As this is not needed by all users, the right way to do this would be to use another codec that adds this to the terms dict.

What is impossible in Lucene (which is a library, not only used by Solr):

Don't spawn thread pools or threads anywhere inside Lucene. Those stuff has to be done by the calling code (e.g., passing an Executor instance, see IndexSearcher). Of course this does not work with static initializers and it can also not passed to codecs. We have a forbiddenapis rule that forbids that. In addition, you are using a thread pool without proper thread names.
Lucene nowhere reads system properties, especially not those starting with "solr.". The right way to configure this is by customizing your codec you pass to IndexWriter/IndexReader.
We also do not want to do staff like this when IndexReader initializes or the first query executes. For that reason "FieldCache" was removed, which was invented as a workaround in Lucene initially (uncontrollable memory usage, delays on searches,...), and it took long time to get rid of it! This patch looks identical to this - it is a "cache" which is generated on the fly. The message here is: If something like a suffix tree is useful for searching, then it must be persisted to disk during indexing (see above proposal to have a BlockTreeTerms variant as separate codec).

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It would be nice to have a simple option to make heavy (infix, prefix) wildcard queries fast.

I think a custom codec, and likely a custom WildcardQuery impl that "sees" it is working with the custom codec and taps into the suffix array, is a good way to implement this. It should maybe be a straightforward conversion of the current patch into a custom codec, i.e. your codec's postings implementation would wrap the default codec and hold onto the WildcardHelper instance.

Separately, I am curious how @dweiss's idea (also index the reversed form of the field, then do two prefix searches and intersect the resulting terms) compares in performance (index time, index size, search heap, query cost) to the suffix array.

The patch falls back to Java's Pattern for checking each term in the more complex cases, but couldn't you just use the CompiledAutomaton's run method to check instead?

I ran Lucene's tests w/ this patch, but first 1) hard-wiring the property check to true, and 2) making the init synchronous (not using the thread pool), and Lucene's WildcardQuery tests hit some failures, e.g.:

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTerms -Dtests.method=testTermMinMaxRandom -Dtests.seed=11FDF2AE5B77A883 -Dtests.locale=es-GT -Dtests.timezone=CET -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.04s J0 | TestTerms.testTermMinMaxRandom <<<
   [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([11FDF2AE5B77A883:5D8E7DDA37EB73E6]:0)
   [junit4]    >    at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:593)
   [junit4]    >    at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:152)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.WildcardHelper.<init>(WildcardHelper.java:106)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.FieldReader.<init>(FieldReader.java:106)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:234)
   [junit4]    >    at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
   [junit4]    >    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:109)
   [junit4]    >    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
   [junit4]    >    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
   [junit4]    >    at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:195)
   [junit4]    >    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:103)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:473)
   [junit4]    >    at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:376)
   [junit4]    >    at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:313)
   [junit4]    >    at org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:76)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)

But those failures a possibly harmless, because those tests are sending non-UTF8 data into Lucene, whereas this change (the property) would only be enabled on fields that are UTF8.

It also hit a stack overflow w/ a long term:

   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testWickedLongTerm -Dtests.seed=524558B2F180613F -Dtests.locale=ru-RU -Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   2.47s J1 | TestIndexWriter.testWickedLongTerm <<<
   [junit4]    > Throwable #1: java.lang.StackOverflowError
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([524558B2F180613F:120594CA57A7BADF]:0)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:82)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    >    at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)

I guess your radix sort implementation is consuming one java stack frame per character in the term. Maybe in practice this is OK too.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The inverted FST should be smaller than the suffix array and intersecting both should be a viable option to get *abc* wildcard matches?

Hmm @dweiss can you explain more how the infix case (*abc*) could be handled by the forward and reversed FSTs?

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

On vacation, so short. Now that I thought about it for longer I think my reasoning here was wrong – the reversed fst would work for prefix wildcards, but for infixes you'd still need a full suffix tree (or an automaton created for all suffixes and leading to term ID):

A suffix array is functionally equivalent to a suffix tree, which you could build and encode as an FST. Then any infix matching would be done similarly to suffix array-based lookups.

asfimport commented 7 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

On vacation, so short. Now that I thought about it for longer I think my reasoning here was wrong – the reversed fst would work for prefix wildcards, but for infixes you'd still need a full suffix tree (or an automaton created for all suffixes and leading to term ID):

A suffix array is functionally equivalent to a suffix tree, which you could build and encode as an FST. Then any infix matching would be done similarly to suffix array-based lookups.

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

System properties for suffix-arra-2.patch:

lucene.suffixArray.enable - true, if you want to enable Suffix Array support. Default value - false.

lucene.suffixArray.initializationThreadsCount - number of threads for Suffix Array initialization, if you set 0 - no additional threads used. Default value - 5.

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

Many thanks to all for feedback, here is the list of changes in suffix-array-2.patch:

Suffix Array construction implemented without recursion, it fixes major bug discovered by TestIndexWriter.testWickedLongTerm test.
Sort wordIds instead of words - words are already sorted in index.
SegmentTermsEnum used inside ListTermsEnum.
Entire Suffix Array construction moved to special thread to avoid startup delays.
Properties renamed to lucene.suffixArray.enable and lucene.suffixArray.initializationThreadsCount.
If lucene.suffixArray.initializationThreadsCount set to 0, initialization is synchronous, additional ExecutorService is not created.
CompiledAutomaton used instead of Java's Pattern.
Additional flag lucene.suffixArray.optimizeForUTF with default value true was added. If it is set to false, we assume that index can contain any bytes, not necessary representing UTF characters. In this case code starts to pass some tests, but for real application it increase memory consumption twice and reduce performance.

asfimport commented 7 years ago

Yakov Sirotkin (migrated from JIRA)

Maybe I have explanation why search with leading asterisk is not easy. Let's assume that you have traditional address book on paper and your are looking for someone with compound surname Zeta-Jones. If you forget the second part you can search by Zeta without any problems. But if you forget the first part, you need to check the whole address book looking for Jones, in fact, index is useless in such case.

apache / lucene

Use Suffix Arrays for fast search with leading asterisks [LUCENE-7639] #8690

8914