apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Add TrieRangeFilter to contrib [LUCENE-1470] #2544

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

According to the thread in java-dev (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to include my fast numerical range query implementation into lucene contrib-queries.

I implemented (based on RangeFilter) another approach for faster RangeQueries, based on longs stored in index in a special format.

The idea behind this is to store the longs in different precision in index and partition the query range in such a way, that the outer boundaries are search using terms from the highest precision, but the center of the search Range with lower precision. The implementation stores the longs in 8 different precisions (using a class called TrieUtils). It also has support for Doubles, using the IEEE 754 floating-point "double format" bit layout with some bit mappings to make them binary sortable. The approach is used in rather big indexes, query times are even on low performance desktop computers <<100 ms ⚠ for very big ranges on indexes with 500000 docs.

I called this RangeQuery variant and format "TrieRangeRange" query because the idea looks like the well-known Trie structures (but it is not identical to real tries, but algorithms are related to it).


Migrated from LUCENE-1470 by Uwe Schindler (@uschindler), resolved Feb 13 2009 Attachments: fixbuild-LUCENE-1470.patch (versions: 2), LUCENE-1470.patch (versions: 7), LUCENE-1470-apichange.patch, LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch (versions: 3), trie.zip, TrieRangeFilter.java, TrieUtils.java (versions: 5) Linked issues:

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Ning, thanks for suggesting. I was thinking abou that, too. In general an idea, would be to use 32 bit integers or floats, if you do not need that much accuracy. In this case, the number of terms is reduced, too. But it may be a good option, to specify a option, that values are indexed with the most possible precision and additionally indexed with lower precision values, too. But The precision step may be dynamic, like: a) precision step gets bigger for lower precisions b) after a precision of XXbits no mor lower precisions are generated and queried. This may be possible to implement by e.g. an array of precision step values that give the splitting of the whole long/int into different precisions (like 2-2-2-2-8-8-8-8-8-16, so precisie values use 2 bit precision step, e.g. from shift 0 to 2, but from shift 48 to 64 a step value of 16 is used).

Uwe

asfimport commented 15 years ago

Ning Li (migrated from JIRA)

Hi Uwe,

I had something similar in mind when I said we can "make things more flexible". Do you think it'll be too complex for users to specify? On the other hand, this is for experts so let experts have all the flexibility. :) We can open a different JIRA issue if we decide to go for it.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In principle it could be in the same way like the field names in the API: An array of precision step values as parameter where now only a single precisionStep is used. A shortcut would be the current API, that internally passes a one-item-array with the only precisionStep (if the array is shorter, the same logic with Math.min like on the field array). The simple-user API has only (like the current), one fieldname and one precision step, the full feaured api has an array of field names for each step and an array of precision step values. But the problem with all this is, that the api gets complexer and complexer, so the simple shortcuts should also be provided and should be recommended.

asfimport commented 15 years ago

Ning Li (migrated from JIRA)

Agree. Do you want to open a new issue? If you want, I can take a crack at it, but probably sometime next week.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A new issue would be good, can you open one? The idea for the patch is almost finished, I can attach a patch shortly. There are some minor things to solve and think about, but its not a big thing.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Removed the splitRange recursion and replaced by a simple loop. Committed rev #745533

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A change for a more universal RangeBuilder API (as preparation for LUCENE-1541). This patch also included the last commit that removes the recursion (for completeness).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed revision 746790.