Add TrieRangeFilter to contrib [LUCENE-1470]

asfimport commented 15 years ago

According to the thread in java-dev (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to include my fast numerical range query implementation into lucene contrib-queries.

I implemented (based on RangeFilter) another approach for faster RangeQueries, based on longs stored in index in a special format.

The idea behind this is to store the longs in different precision in index and partition the query range in such a way, that the outer boundaries are search using terms from the highest precision, but the center of the search Range with lower precision. The implementation stores the longs in 8 different precisions (using a class called TrieUtils). It also has support for Doubles, using the IEEE 754 floating-point "double format" bit layout with some bit mappings to make them binary sortable. The approach is used in rather big indexes, query times are even on low performance desktop computers <<100 ms ⚠ for very big ranges on indexes with 500000 docs.

I called this RangeQuery variant and format "TrieRangeRange" query because the idea looks like the well-known Trie structures (but it is not identical to real tries, but algorithms are related to it).

Migrated from LUCENE-1470 by Uwe Schindler (@uschindler), resolved Feb 13 2009 Attachments: fixbuild-LUCENE-1470.patch (versions: 2), LUCENE-1470.patch (versions: 7), LUCENE-1470-apichange.patch, LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch (versions: 3), trie.zip, TrieRangeFilter.java, TrieUtils.java (versions: 5) Linked issues:

SOLR-940
- 2446
- 2535

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Sorry for again a new patch: When again looking into the test, I missed a test for the automatic encoding detection by string length (TrieUtils.trieCodedToXxxAuto()). The appended patch fixes the hudson build and adds this test.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm – I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files.

OK this seems to fix it:

Index: contrib/contrib-build.xml
===================================================================
--- contrib/contrib-build.xml   (revision 723145)
+++ contrib/contrib-build.xml   (working copy)
`@@` -61,7 +61,7 `@@`
   </target>

-  <target name="init" depends="common.init,build-lucene"/>
+  <target name="init" depends="common.init,build-lucene,build-lucene-tests"/>
   <target name="compile-test" depends="init" if="contrib.has.tests">
     <antcall target="common.compile-test" inheritRefs="true" />
   </target>

I'll commit that, and the fix to the test case. Thanks Uwe!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm - I would prefer that contrib tests subclass LiaTestCase

Woops, I meant LuceneTestCase ;) Time sharing not working very well in my brain this morning...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Committed revision 723287.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use.

Sigh, you are correct. How would you fix FieldCache?

I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness.

Sigh, you are correct. How would you fix FieldCache?

I would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have:

SortField(String field, int type, FieldCache.Parser parser, boolean reverse)

and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...)

I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting.

The problem with the extra field would be, that it works good for longs or doubles (with some extra work), but Dates still keep as String, or you use Date.getTime() as long. But this is not very elegant and needs more fields and terms. I prefer a clean solution.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness.

OK done.

would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface.

This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right?

but Dates still keep as String, or you use Date.getTime() as long

Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yes, I will open an issue! Maybe I maybe create a first patch after looking into the problem.

This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right?

I am not sure, but I think its better to leave it as now. On the other hand, if we just have a "marker" super-interface, it should be backwards compatible, because the new super-interface is new and existing code would only use the existing interfaces. New methods are not added by the super interface, so code would be source and binary compatible (as it only references the existing interfaces). I think we had this discussion some time in the past in another issue (Fieldable???), but this was another problem.

Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right?

Correct. As soon as this works, I would simply add as "extra bonus" o.a.l.search.trie.TrieSortField, that automatically supplys a correct parser for easy usage. Date, Double and Long trie fields can always be sorted as longs without knowing the correct meaning (because the trie format was designed like so).

Currently my code would just sort the trie encoded fields using SortField.STRING, but this resource expensive (butI have no example currently running, as it was not needed for panFMP/PANGAEA and other projects).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike,

I opened issue #2552 and attached a first patch.

About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. Can you add it there or should I write a small patch? I think it should at least be mentioned there for what it is for, but the JavaDocs are much more informative and the corresponding paper / code credits are cited there.

Thank you very much for helping to get this into Lucene!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thank you very much for helping to get this into Lucene!

You're welcome! But, that was the easy part ;) Thank you for creating it & getting it into Lucene!

About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt.

I agree – can you create a patch? Thanks.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here the readme changes.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Committed revision 723701.

Thanks Uwe!

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Attaching completely untested prototype TrueUtils.java

some discussion: http://www.lucidimagination.com/search/document/d62c0fd21d88f880

Features:

same encode/decode code works for any variant... no 2,4,8 bit specific instances
decouples "slicing" of the value into different precisions and encoding of the slice to a String, allowing for the most efficient String encoding to be used for every prevision variant.
7 bit char encoding to optimize for UTF8 index storage
right justified to allow lucene to prefix compress efficiently
separates creation of sortableBits from trie encoding of those bits to avoid so many methods
allows indexing into multiple fields, or all in the same field
much smaller code should be much easier to understand
left out "Date" support - the average Java developer understands how to go from a Date to a long (unlike double, etc).
relatively trivial to add 32 bit (int/float) support and reuse code like addIndexedFields (which is just an agnostic helper method).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks for the code fragments, I can take this as basis and will implement the TrieRangeFilter counterpart. I have some more ideas (e.g. for the beginners API there is missing the possibility to store the full-precision value). And norms should be disabled per default (in beginners API). 32bit support is also simple, as you noted.

The problem is now, how to make TrieRangeFilter so generic, to support all posible encodings and field names possible (with your code, it would also be possible to put each precision in a separate field). I work on that, I want to have clean API on the TrieRangeFilter!

I would name both addIndexedFields() and addField with the same name (just overload), but different options. I prepare that!

I agree, date support is not needed. And if I would again add Date support, Calendars should also be possible etc. No need for that - Date should be deprecated by Sun!

One question: The encoding is now different from NumberUtils again - NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only? Would Solr use this encoding in future (7bit chars) for numeric values or should be both separate? These TrieUtils now also make it possible to not use trie coding, but encode doubles/longs for other use (and use the currently missing LongParser/SortField generators).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Adding that as comment:

> On Sat, Feb 7, 2009 at 12:29 PM, Uwe Schindler <uwe@thetaphi.de> wrote: > > This is only a minimal optimization, suitable for very large indexes. > The > > problem is: if you have many terms in highest precission (a lot of > different > > double values), seeking is more costly if you jump from higher to lower > > precisions. > > That's my point... in very large indexes this should not result in any > difference at all on average because the terms would be no where near > each other.

OK.

I prepare a new TrieRangeFilter implementation, just taking the String[] fieldnames and the sortableLong and the precisionStep.

And I think, you are right. We could completely remove the "storing" API. If one wants to add stored fields, he could use NumberUtils and do this separately (add stored, not indexed fields). For TrieRangeFilter it is not neded.

> As an example: in a very big index, one wants to independently collect > all documents that match "apple" and all documents that match "zebra", > which term you seek to first should not matter.

OK, I agree :)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Reopening this issue. I will later prepare a patch (have no time now, its now Saturday evening here in Germany), completely changing the current API and encoding, thanks Yonik! I think, the TrieRangeFilter will get smaller, too (ok, the c'tors may stay for beginners). The Javadocs need to be updated too.

I will also add missing methods for LongParser and SortField, as the encoded fields can be stored in ExtendedFieldCache (but there as real Longs, not sortableLongs) - as before.

I will also add 32 bit API.

Thanks, Yonik!

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

for the beginners API there is missing the possibility to store the full-precision value

They could simply store it in a different field, in whatever format they desire, right? It seems like TrieRange should be about range matching, not the format of stored fields.

NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only?

NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time :-)

If there are to be many in-memory representations, then using 14 bit chars might be better. Otherwise it seems like 7 bit might be preferable (better prefix compression, more predictable branches in the UTF8 encoder/decoder). Of course it's a trivial switch, so perhaps we should just try and benchmark it when everything else is done.

As for TrieRangeFilter, I guess the most generic constructor would look like:

TrieRangeFilter(int precisionStep, String[] fields, long lowerSortableBits, long upperSortableBits, boolean includeLower, boolean includeUpper)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time

The same here :). By the way, the new SortField constructors taking a LongParser (#2552) as parameter make this very simple. You could so sort by the trie encoded field (and do not need to index them separately). Just use the LongParser supplied by TrieUtils to sort. No String sorting needed. I only wanted to supply static parsers based on TrieUtils for that.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

For TrieRangeFilter I now also need something similar to the current increment/decrementTrieCoded, that adds 1 to a prefix coded value (in principle, just add 1<<shift to the long an use it, something like that) This is needed for the matching of the parts of the range. More later.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I modified the proposal TrieUtils a little bit. Maybe this class could get the new NumberUtils with the extra option to trie encode the values.

Added 32bit support
Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs).
The conversion from doubles and floats was renamed and returns the standard signed long/int: doubleToSortableLong and floatToSortableInt, Date is removed (as just Date.getTime() can be used, as everybody knows).
Still missing are Long/IntParser for FieldCache and a SortField factory.

It's still untested!

I will implement tomorrow (now its time to go to bed) the TrieRangeFilter in two variants (one for 32 bit ints another for 64 bit longs). The min/max values are the ints/longs. The trie coding is also done using the shift value and bit magic. The results of the range split are then encoded using TrieUtils.xxxToPrefixCode().

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs).

Nice Uwe... we're really on the same wavelength now - I came to the exact same conclusion and made the exact same changes! It's much nicer having ints and longs as the exposed "interface" so if a<b in java then a will come before b in the term index.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A fixed TrieUtils (a small bug in the encoding routine caused endless loop), the increment in array was missing :-)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A first test version of TrieRangeFilter. Much of the functionality iss missing (no hashCode, some dead parts,....). But: it works. I modified the tests and get it running. Both variants (old and new) visit the same number of terms and get exact results.

To do further testing it is now important, to generate Ranges with negative long. It seems to work, but I need more tests. It was a hell to get it running, I hate Sun for not having unsigned ints/longs in Java :-(

TrieRangeQuery is obsolete now. You can wrap the Filter using a ConstantScore query or user TrieRangeFilter.asQuery() [which will return a nicer toString() output).

I will now clean up everything and create a patch!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I forget to mention: I modified the TrieUtils to have another SHIFT_START for ints and longs. By this you can earlier throw a NumberFormatException of you try to decode long using the int method or vice versa. I added these checks, to fail early, when old indexes using the different encoding are used or sorting may use the wrong encoding. TrieUtils still needs the FieldCache parser, but this is a trivial addition.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yonik: How would you hande a TrieRangeFilter for int's? I could create a separate class that looks identical (but uses the other TrieUtils methods and int instead of longs), or combine both in one class (which is hard). The problem is, you cannot have the same codebase, because masks, shifts, 31 vs. 63 is everywhere. And the other datatype. If I do it, I will have two classes: TrieRangeFilter32, TrieRangeFilter64, alternatively TrieRangeFilterInt, TrieRangeFilterLong. Whats your opinion?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Now the next step: A corrected TrieUtils again.

contains the FieldCache parsers for easy usage and sort field factories.
changes the SHIFT for ints (0x80 was too much, UTF-8 optimization needs chars <0x80)

Sorting tests now also work.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Uploading a new version of TrieUtils.

I've implemented my own range splitting logic (it will be interesting to see how it compares to yours, and if mine actually works! so many edge cases...)

It separates the range splitting logic from the execution of anything, so we can use the same logic to generate queries that match the same range (no building an OpenBitSet first), or other things like printable queries for debugging.

After reflection, the RangeBuilder interface could be simplified to just have a single addRange() method... with the implicit assumption that all ranges are ORd together.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Since filters are a symbolic representation (like Query), it makes sense to have separate TrieRangeFilter32 and TrieRangeFilter64 classes. Hopefully most of the logic that is in common can be shared somehow.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I think the range splitting code I wrote is generic enough to also automatically handle 32 bits too. I think it's just a matter of starting with a different shift?

So for longs we have:

  public static Object buildRange(RangeBuilder builder, long ll, long uu, int precisionStep) {
    // figure out where to start shift at
    int shift = ((64-1)/precisionStep) * precisionStep;
    return buildRange(builder, ll, uu, shift, precisionStep);
  }

And for ints, it would simply (hopefully) be:

  public static Object buildRange(RangeBuilder builder, int ll, int uu, int precisionStep) {
    // figure out where to start shift at
    int shift = ((32-1)/precisionStep) * precisionStep;
    return buildRange(builder, (long)ll, (long)uu, shift, precisionStep);
  }

Does that sound right?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am still "reading" your code, but the main parts look identical (also the break condition, if no inner range available -> this condition is very important and must be min>=max). There are only differences between both are the generation of the inner range bounds. I just rewrote my old code, did you do it completely from scratch?

I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method. The range will always be or'ed (anding makes no sense). The same idea came me, when I tried to unify the 32bit and 64 bit variant in my TrieRangeFilter code.

For performance reasons, it is better to use the same TermEnum and TermDocs and only one OpenBitSet when executing the range. But this can be easily handled using the interface (RangeBuilder initializes the Openbitset, TermDocs and TermEnums, like in my code. When the range was built, the OpenBitset is retrieved).

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

did you do it completely from scratch?

Yep, took me a while... An interesting exercise and good news if it's so similar to yours. Hard code like this is actually fun too :-)

I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method.

Sounds fine - that means that RangeBuilder instances will always need to be stateful, but that seems flexible enough and simpler to understand than passing around Objects and casting.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I like your idea, I will also use a generic Range splitter (as described before), but use my algorithm for it. But I am really interested how both compare. You can test this, if you use mine and enable the System.out in setBits (which is identical to your addRange()), that returns hexadecimals for each range.

For nicer code, I would supply two interface to build ranges (int, long), because the TrieUtils always to correctly use int/long and the user may be confused if he gets longs (even if the api is very special for experts). But it can e.g. be used to create a BooleanQuery using classic RangeQueries or'ed.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There may be one issue with signedness in your's and mine code for the special case of a precisionStep of 1. In this case the outermost/innermost calculation of the range bounds can fail, but I am not sure. I wanted to write a test for it.

I am currently testing only 8, 4, 2 - as the results for that must be identical to my old code (I also printed a lot of System.outs in my old code to generate a exact match of everything). There are some special cases in range splitting that are very sensitive. It took me also very long time to reimplement it. But the usage of longs/ints is really nicer and cleaner than incrementing/decrementing string values, as I did before.

It is now also possible to use a completely other encoding, without changing the range splitting (e.g. to compare 7bit chars to 13/14 bit chars from NumberUtils).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Yonik,

I implemented the generic interface. I changed it a little bit (it throws now Exception, because without that the building of a range then needs wrapping the IOExceptions of IndexReader with a RuntimeException, not so nice). There are now 2 times the same implementation for 32 and 64 bit and two interfaces (long, int).

The TrieRangeFilter is now a implementation using the LongRangeBuilder. The test, I used, is also there (just replace the files in contrib). The TrieUtilsTest is currently not working, TrieRangeQuery is obsolete.

I think I will create a helper, that does the TermEnum/TermDocs iteration and reuse it in both LongTrieRangeFilter and IntTrieRangeFilter (maybe they get both the same superclass containing important methods useable for both).

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Oh I see.... one difference between our code is that you start at the lowest shift and I start at the highest. Counting up has a nice effect of getting rid of the calculation of what shift to start at... I just had a harder time thinking about the recursion in that direction.

Anyway, it's all looking great!

Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4 0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I was preparing the whole day the final version, including all javadoc, but then i overwrote the wron file and the whole work of today (according trie) is away. Give me one more day, and I will redo everything. I changed since yesterday a lot in the trie code. Your code was a little bit better when the range bound of one precision was exact on the range's start or end (in this case the precision could be left out, in your code the boolean needUpper and needLower). I implemented this similar.

I also extended the interface a little bit, but this is work I have to redo. So it takes now longer. Most work is writing documentation and javadocs. If everything had worked ok (and I did not overwrite/update svn in the wrong way, I would be finished now :(

Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4

0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

The most efficent precision is sometimes hard, but the optimization above with needUpper/needLower is really good sometimes (depends on the range). I think about it.

Uwe

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4 0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

I misunderstood your note. My last optimizations (now they are again restored in my svn) exactly handle this. There are two cases:

if the left range of a precision has ((min & mask) == 0L) [starts exactly at the beginning of the next more unprecise range], I leave out the left range for the actual precision and directly use the next lower prec
if the right range of a precision has ((max & mask) == mask) [ends exactly at the end of the next more unprecise range], I do the same for the right one.

My new code is now cleaner and easier to understand (there were some other unneeded extra shifts/ands in it). I also merged the 64 bit and 32 bit range splitting and wrap the RangeBuilder classes accordingly (now abstract with two different range collecting possibilities).

The old code was not aware of this, leading to sometimes left/right precisions that are not needed. I will add test code in TestTrieUtils, that tests the range split (without an index). The problem here is, how to test this effectively. I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list). Do you have another prossibility to test this without a lot of manually checks example ranges?

I have now restored all my changes (and did an extra backup). I will now write again the new Javadocs of TrieUtils and package.html. After that I will post a patch with the final API. The extra and new tests will be added later. First I want to fixate and document the API.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list).

+1

I think that, in conjunction with some random testing should be sufficient.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One of the trie filter test also checks, that the range is completely matched (when using a index with incrementing values). An additional test in TestTrieUtils can test for not having overlapping ranges. This can easily be tested using a OpenBitSet in which a RangeBuilder sets the bits. If it hits a bit two times, fails the test. I wrote this down here, to not forget it later, when writing this test.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This is the first patch for the revamp of TrieRange. It contains the complete API, a complete API documentation and some tests.

Still missing is:

TestIntTrieRangeFilter
TestTrieUtils (the old one is still there but commented out)
Tests as described in previous comments (in TestTrieUtils): No range overlap, most optimized rangeSplit

Yonik: Can you look over it and say, if this is, what you would like to have for full flexibility and Solr (which needs this full flexibility)?

All others: Is something missing in the API (like shortcuts), any comments?

If everything is OK, I will commit the patch after adding the important tests.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Typos and wrong english in the docs can be fixed after the commit, merging of additional patches is simplier without all those small changes. The important things are API and functionality.

Uwe

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This is (hopefully) my last patch. It contains all tests and the final API (with some modifications).

The split range test is a bit ugly, but it just test, if the algorithm works exactly like it should (but currently accepts no other order of addRange() calls when splitting the range) - and it tests only one "reference" range.

Please give some comments, Yonik!

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Looking forward to the latest incarnation Uwe, but I'm traveling through the rest of the week... I'll definitely check it out at some point, but I liked the previous ones so you should go ahead and commit if you feel it's ready.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks for the answer,

My time is limited at the moment, too. I will commit Friday or the weekend!

Uwe

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch with:

some optimizations in the splitRange (now it works even better) and more understandable again
split range now also works correct for precisionStep=1
new tests for splitRange (special cases are also tested, e.g. Long.MIN_VALUE..Long.MAX_VALUE can easily be done with only one range on the lowest precision)

I will commit soon.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed revision #744207

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I forgot: Thanks Yonik for the good ideas and discussions about API and help with coding this new trie implementation!

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

OK, I got a chance to further check things out and do some manual testing to ensure that the most efficient forms are always used. Everything looks good!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Cool. So you like the new code? Are you happy with the API and how it works, is it good to use in Solr? If yes, it would be great and I am happy, that it is of use :-)

There seem to be no problems, I converted my own code to use the new TrieRange API and I like it. No problems. 8bit precisionStep works good for me, my index with 13 numeric trie fields and 600,000 docs works fine, no performance differences, queries run amazingly fast. Index size is almost identical.

I hope I will get some synthetic performance the next days, do you have some code for the performance contrib to check performance (I am not so familar with the performance code, I will check it out).

Uwe

asfimport commented 15 years ago

Ning Li (migrated from JIRA)

Good stuff!

Is it worth to also have an option to specify the number of precisions to index a value?

With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). With precision step 2 and number of precisions set to 24, the number of terms for a range is still quite small but a value is indexed in 24 terms instead of 32. For applications usually query small ranges, the number of precisions can be further reduced.

We can provide more options to make things more flexible. But we probably want a balance of flexibility vs. the complexity of user options. Does this number of precisions look like a good one?

apache / lucene

Add TrieRangeFilter to contrib [LUCENE-1470] #2544

2446

2535