Add TrieRangeFilter to contrib [LUCENE-1470]

asfimport commented 15 years ago

According to the thread in java-dev (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to include my fast numerical range query implementation into lucene contrib-queries.

I implemented (based on RangeFilter) another approach for faster RangeQueries, based on longs stored in index in a special format.

The idea behind this is to store the longs in different precision in index and partition the query range in such a way, that the outer boundaries are search using terms from the highest precision, but the center of the search Range with lower precision. The implementation stores the longs in 8 different precisions (using a class called TrieUtils). It also has support for Doubles, using the IEEE 754 floating-point "double format" bit layout with some bit mappings to make them binary sortable. The approach is used in rather big indexes, query times are even on low performance desktop computers <<100 ms ⚠ for very big ranges on indexes with 500000 docs.

I called this RangeQuery variant and format "TrieRangeRange" query because the idea looks like the well-known Trie structures (but it is not identical to real tries, but algorithms are related to it).

Migrated from LUCENE-1470 by Uwe Schindler (@uschindler), resolved Feb 13 2009 Attachments: fixbuild-LUCENE-1470.patch (versions: 2), LUCENE-1470.patch (versions: 7), LUCENE-1470-apichange.patch, LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch (versions: 3), trie.zip, TrieRangeFilter.java, TrieUtils.java (versions: 5) Linked issues:

SOLR-940
- 2446
- 2535

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

First version of a patch for contrib-queries. It includes my classes for package o.a.l.search.trie. The TrieRangeFilter was refactored to be a separate class (in contrast to my original implementation). The class is Java 1.4 compatible (like contrib-queries). JavaDocs were updated and a general information page for the package was given.

A first test case to test the trie-encoded values was added. The filter and query tests must be written.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

See also the discussion in #2535, as it inspired me to give my code to contrib.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

fixed patch

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

Lucene (at least 2.3.2 version) cannot index Strings containing 0xFFFF characters, as it is used internally as EOS marker. I've implemented binary encoding similar to yours, but after hitting this issue with some converted Date, switched to base-2 ¹⁵ encoding, leaving high bit always zero.

  public String toLucene(Long n) {
    if (n == null) {
      return null;
    }

    long u = n + Long.MAX_VALUE + 1;

    return new String(new char[]{
        (char) ((u & 0xFFFE000000000000L) >> 49),
        (char) ((u & 0x0001FFFC00000000L) >> 34),
        (char) ((u & 0x00000003FFF80000L) >> 19),
        (char) ((u & 0x000000000007FFF0L) >> 4),
        (char) ((u & 0x000000000000000FL) << 11),
    });
  }

  public Long fromLucene(String string) {
    if (string == null) {
      return null;
    }

    return (((long) string.charAt(0)) << 49 |
            ((long) string.charAt(1)) << 34 |
            ((long) string.charAt(2)) << 19 |
            ((long) string.charAt(3)) << 4 |
            ((long) string.charAt(4)) >> 11
    ) - Long.MAX_VALUE - 1;
  }

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

Read your code closer. Silly me, I assumed you went with base-2 ¹⁶.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

No problem, I have choosen the two chars / byte approach for two reasons:

a) it is simplier to generate lower precision terms (for 1-8 bytes precision) you just have to remove 2 chars per byte. in base 2^15, you only have 4 precisions and some more bits. The number of terms in my special TrieRangeFilter per precision would not be 256 values, it would be 32768, making it slower. Another posibility would be to cast each byte to one char (+offset), but see the next point: b) Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo()

But one other point you noted in the other JIRA issue is hurting me: I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field. A earlier version of TrieRangeQuery used a suffix after the field-name, automatically added by the TrieUtils.addXXXXTrieDocumentField. I reverted removed this about two years ago, but never tried to sort a numerical field since then. I think, I have to write a bug report for my own project :) The question is, if this is included into contrib and I put a dependency in my project to this contrib pacakge, the index format of panFMP may change, which is not good.

Just one question to other developers: Why is it not possible, to sort by a field with more than one term/doc, if you would restrict this to only use the first added term to the document as sort key in FieldCache?

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

in base 2^15, you only have 4 precisions and some more bits

We have slightly different approaches. Yours is universal, and mine requires hand-tuning. I have an abstract FastRangeFilter, which I extend, specifying functions that lower precision before encoding, thus I have any required amount of type/field-dependent precision levels. For further ease of use that one is extended by FastDateRangeFilter, which accepts an array of date parts, like {HOUR_OF_DAY, DAY_OF_MONTH}. That allows me to exploit known statistical properties of my requests/data, for example most date ranges are rolling day/week/month/3 months windows, or salaries which tend to be attracted to certain values.

Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo()

Fact that java strings are UCS-16 (UTF-16 minus special non-16bit characters) is written into java language specification, so you can trust String.compareTo() - anything that blows up there, is not Java(tm). Problems usually come from within libraries.

I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field.

Actually, right now I'm ~~stealing~~ borrowing your idea of storing various precisions in the same field. I have my own custom field cache, so broken sort can be fixed easily. Having everything in the same field allows you to do exactly one pass through TermEnum/TermDocs, and that is what I'm going to exploit :)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

You are right, my intentation was to make it universal by casting everything to unsigned long. For the sorting problem: With standard lucene, you can only sort when you have one term/field/doc. I am currently testing my code after modifying it to use a more compact version of the encoding:

a) I do not encode half bytes, I encode full bytes per char. I still have 8 precisions after it, but code gets simplier and you have terms with length=8 chrs after encoding. After reading docs of String.compareTo() again, I understood the "Java-Way" of binary sort order and verified it with my test cases.

b) the full precision is stored in the user-given document field name, all 7 lower precisions are prefixed (as before) and put into a "helper" field with suffix "#trie" after the name. This field is only indexed, not stored or anything else. This field is only used in TrieRangeFilter for the trie algorithm.

For my project panFMP it is not such a big problem, if you inform the "few" users using it (I know all of them) and ask them to do a index rebuild (for that a tool is available). But the encoding format of this contrib package should be fixed and discussed before it is released!

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

But the encoding format of this contrib package should be fixed and discussed before it is released!

Make it pluggable? I have my encoding separate from range filters and precision loss, you can specify fields to use, etc-etc. Anyway, people can always copy your code and tweak it to their specific needs. For contribs, imho, the idea is much more valuable than actual implementation details.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Why is it not possible, to sort by a field with more than one > term/doc, if you would restrict this to only use the first added term > to the document as sort key in FieldCache?

This was discussed, but not resolved, in #2446.

If we wanted to allow "use the first term" as a policy, we'd have to fix how FieldCacheImpl computes the StringIndex: currently it gets a TermEnum, walks through all terms, for each term walks through all docs, and records docID->termNumber as well as termNumber->termText arrays, un-inverting the field. I guess we could get a TermPositions, instead, and only pay attention to position=0 terms, and then only increment t if there were any docs with position=0 occurrences of the term.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It seems like there are two things you might want to make "pluggable":

Which hierarchical ranges to create & logic that creates them for a given value. For certain types (dates) I think there's one obvious set of ranges (year, quarter, month, day, etc.). For something like "price" it's probably more app dependent, eg if you present certain filters/facets to the user.
What label is attached to each range, and the correspond decoder to map that label back into its upper/lower values.

So we could provide good defaults for eg Date. The labels would ideally be things like 2007, 2008 and "Jan 2008", "Feb 2008", etc. I think? Hmm – except for the need to sort them "properly".

We really should change Lucene's terms so that they are opaque objects (as KS/Lucy is doing). This way you could store a Term like "Feb 2008" in the index, along with details like "this is a generated ranged datetime term" and the numeric date details ie 02/2008, and you'd just have to export a compareTo to Lucene.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

An updated version of the patch. The format of encoded terms is incompatible to the previous patch.

Changed:

Each byte of the 8 byte longs is encoded as 1 char (see coments before) -> more compact
each byte is shifted by 0x30 when converted to char (e.g. value 0x05 is saved as char 0x35)
the lower precision terms are stored in a separate field (using the original fieldname+"#trie"). With that it is possible to sort the original field. The mentioned helper field is only index. The prefix for the lower precisions starts at char 0x20. By separating prefix and data ranges, later merging the separate and the original field is easily possible. Because of this lower precisions would be listed before higher precisions or the full precision in the ordered term list.
fix a small issue with checking a term two times (between two ranges)
added test for TrieRangeQuery with a full-bounded and half-open range.

I will say something about Mike's suggestions tomorrow, now it is to late!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I wanted to go to bed, but it is the same every time, cannot sleep :-)

The discussions about TrieRangeQuery are the same like two years ago when I prepared the original paper cited in the documentation with my mentor/colleague Michael Diepenbroek (the second author on the paper) – here my comments about the suggestions to Mike McCandless and Earwin, after looking into my notices, I can explain:

When looking for the first time on the algorithm to split the range, the ideas of Mike and Earwin are very clear: You are both interested in Dates and both of you have the same idea, how to divide them into parts with different "precisions". For dates it is rather simple: If you have a range between March 1999 and August 2006, the two precisions "full years" and "months" are very easy to understand when creating the ranges. You would have three range queries: Mar'1999 ... Dec'1999, 2000 ... 2005, Jan'2006 ... Aug'2006 very easy. You have to visit 10+6+8=24 terms. Simple. The maximum number of terms would be 12+12+numberOfTotalPossibleYears. This is easy to understand and very effective.

My approach in principle does the same, but the values are simply devided into bytes. There is no problem with it to store the number of milliseconds since 1970-01-01 in different precisions. The total range of this long would be very very large (I do not want to calculate now how many years BC the value -2^-31ms would be). If you have a range like the one above you have e.g. longs like 0x0000323113422134 TO 0x0000325113422174. You see the frront is identical, the center of the range maybe only goes to the precision of 12 digits. I said the long ist divided into 8 bytes, so you would use only the end of the range in highest precsision, the center in lower. As the lowest precsison does not change (it would only change if you are thousands of years from start to end), my algorithm would not use it, it would only go downto the precision that exists. The if-clause "(length==1 || minShort.compareTo(maxShort)>=0)" is responsible for that. If the lower and upper end of the next lower precision range is zero or inverted, it is not further calculated.

Because of this shortcut it does not make a difference if the term values are distributed in the index very far from each other or are all on one place (the algorithm works as good, if your index contains prices between 1$ and 100$ or if you have double that give ages of millions of years and you want to do a range select on them. Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range)(a maximum of 255 terms in the uppermost range)(255)(255)......+(255 in center of range). This is a hard limit, and this limit can only be reached if you would have an index with 2^64-2 documents, that have 2^64-2 different long values and you have a range that selects all of them (only in this case you have to visit 3825 terms!).

You can test a little bit with the new test case in the latest patch (change some numbers) and uncomment the System.out.println() for the number of terms seen. You will see, even with large indexes, the number of terms in mostly about 400. Only very very seldom more terms were visited (I have seen in our 500,000 docs index only one case with 750 terms, which can happen if you have a non-homogenous dispersion of values that has a local maximum near one of the range bounds). Important: The TrieRangeQuery only makes sense if your index as a minimum of about 500 docs, if there are less, you mostly have to visit all terms. Just play with the test case and do some statistics!

Because of all this, I see no real need for giving the user a customized way to convert the values into Terms and calculate lower precisions of it. The above code works performant (you have tested it yesterday with our index) even for doubles, that are very seldom homogenous distributed through the whole LONG range if stored as IEEE-754 bitmap. So TrieRangeQuery in the current version is effective for longs, doubles and Dates (as they are longs). The only problem would be fixed comma numbers with many digits (e.g. BigDecimals), they cannot simply casted to longs. For these type of numbers, TrieUtils could be extended to generate longer strings.

To space usage: You are right, if you only want to have ranges on year/month, a long is too much. But keep in mind, the length of terms is not so important for index size, important is the number of different terms in the TermEnum. So I see no problem with storing integers and doubles and dates as a 64bit value in index - its only little bit extra term length with "unused bits".

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range)(a maximum of 255 terms in the uppermost range)(255)(255)......+(255 in center of range).

The code uses a trie factor of 256, or 8 bits in a long of 64 bits. Would it be possible to use lower values for this trie factor, like 16 (4 bits) or even 4 (2 bits)? In such cases the (rough) maximum number of terms for a single ended range becomes smaller: (256-1) * (64/8) = 255 * 8 = 2040 (16-1) * (64/4) = 15 * 16 = 240 (4-1) * (64/2) = 3 * 32 = 96 This reduction comes at the cost doubling or quadrupling the number of the indexed terms in the lower precision field.

The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits.

So the question is would it be possible to specify the trie factor when building and using the index?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think, this would easyly be possible. Refactoring some code parts, special implementations of the encoder and incrementTrieCoded/decrementTrieCoded would be possible. A possibility to make this configuraeable would be to use an instance of TrieUtils, that takes the trie factor as constructor argument. The user would then encode/decode all his values using the instance. TrieRangeFilter would also get an instance of the encoder and use it to calculate the prefix terms and so on.

The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits.

bq. So the question is would it be possible to specify the trie factor when building and using the index?

Yes thats good for jumping to the correct term to start in the range query. The problem with shorter trie factors would be, that for each precision (e.g. 4 bits, 2 bits) you need one full char in the encoded variant. As length is not a problem for terms, I think the common prefixed cannot be used so effective (a lot of terms with two low-cardinality chars at the beginning).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In such cases the (rough) maximum number of terms for a single ended range becomes smaller:

Are you sure? My algorithm is more effective on single ended ranges (the number of terms is half the size of the terms of a two-ended range). When the lower bound is open, it is enough to use the lowest precision in the lower half of the range beginning (this special case is done in splitRange()). Higher precision terms only occur at bthe higher bound. This is why I use 255 and not 256 for my formula of the maximum number of terms (a range of all 256 terms in higher precision could easily reduced to the lower precision).

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

I simply used the maximum number of terms at each level times the number of levels. For a trie factor of 2**b (using b bits), and a long of 64 bits, that is: (2**b - 1) * (64/b) The actual values for b = 8, b = 4 and b = 2 are shown above.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

You are right, thats correct! I misunderstood your answer. The number of terms for single ended ranges is about half of the terms for a two-ended range.

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

So going from 8 bits to 4 bits will reduce the number of filter terms by a factor 8.5, for a cost of doubling the number of indexed terms. I'd like to see actual performance figures, but on the face of it I'd rather use 4 bits.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Just a update of the patch, before I want to implement Paul's suggestions on trie factor.

The current patch has some performance improvements by avoid seeking back and forth in IndexReader's TermEnum. IndexReader's TermEnum may also better use caching. The parts of the range, that use Terms, that come earlier in the TermEnum are done first. The order is now:

Highest precision (because the field name of highest precision is not suffixed, and so the terms come earlier, "fieldname"<"fieldname#trie")
Lower precision starting with the lowest one. The prefix of the lowest precision is the smallest (0x20) and goes up to 0x27

When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term). Why this difference? I first wanted to change my code to use only one instance (like with the TermDocs) of the TermEnum through the wohle range split prcess and seek the enum for each range, but I dropped that change.

Further improvements of that patch are more comments and a cleaner (and more elegant) code in TrieUtils (without using a StringBuffer).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sheesh, doesn't anyone sleep anymore?

When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term).

Did you mean TermEnum.skipTo? See #2511.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

My approach in principle does the same, but the values are simply devided into bytes.

I do very much like that your approach requires no "application or type-specific knowledge" on what might make good ranges. Ie, it's generic, and so any type that can be cast into long and back w/o losing any information, can make use of this.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Did you mean TermEnum.skipTo? See #2511.

Yes exactly that, I wrote seekTop(), but meant skipTo(). The patch in this issue is only a faster scanning approach, but the best would be to make skipTo() use the same logic like SegmentInfoReader.terms that uses seeking on disk and so on, but this comment would be better for the above issue.

I am fine with calling IndexReader.terms(Term) to use the cache and faster seeking. The cost of creating new instances of TermEnums is less than doing disk reads.

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

When the trie factor can be chosen, it might make sense to encode it in the trie field name, something like:

fieldName#trieNN

with NN the trie factor or its number of bits.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here is a new version of the patch, supporting 8bit, 4bit and 2bit trie variant. The API is similar, but instead of using TrieUtils as static class, you can choose 3 singleton converters: e.g. TrieUtils.VARIANT_8BIT.convertSomething(). TrieRangeQuery and TrieRangeFilter both accept a parameter for choosing the variant. A default can be set with a static TrieRangeFilter.setDefaultTrieVariant(...) and the corresponding getter.

To Paul's suggestion: You could put the trie variant into the filed name, but the main fieldname, where the full precision value is indexed/stored/whatever, has no suffix. This would make it inconsistent.

In my opinion, one should choose the same trie variant when indexing values and doing range queries. It is the same like choosing another analyzer for indexing and searching. Maybe setting the default trie variant with this static method maybe better hosted in TrieUtils, but this is room for discussion.

Now it is time to do some performance tests with really big indexes comparing the trie variants :-) Does anybody has a synthetic benchmarker that can be easily used for this? The test case is not so interesting (uses RAMDirectory). I could reindex our PANGAEA index for performance testing in real world (doubles, dates).

The testcase (if you uncomment the printout in TrieRangeFilter of number of terms) clearly shows the lower number of terms visited for 4bit or 2bit. Formulas are in the package description.

In my opinion 4bit is a good alternative (about 3 times space requirement for about 8.5x less terms), but the impact of 2bit is to low for the about 6 times larger space.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

A note of caution - I noticed when moving from Lucene 2.3 to 2.4 that my similar scheme for encoding information meant that I couldn't encode information using byte arrays using bytes with values > 216. The changes (I think in Lucene-510) introduced some code that modified the way the bytes were written/read and corrupted my encoding.

Not sure if your proposed approach is prone to this or if anyone can cast further light on these encoding issues.

Good to see this making its way into Lucene, Uwe.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I do not know what you had done in your code. Did you directly converted the byte[] arrays to a string, e.g. by using new String(byte[],charset) or without charset (in this case Java would use whats actual, but maybe its UTF-8)? Then, the character 216 (0xd8) would be interpreted by new String() as an UTF-8/16 sequence or whatever and map to some unknown char. UnicodeUtils, on the other hand, encodes the java chars to UTF-8 when storing in index, but does not like chars >0xd800 and replaces them by the replacement char for unknown chars. And this is a modification, as the char >0xd800 is not valid (see source of UnicodeUtils). and 0xd800 looks like 0xd8 (==216).

My code does not transform a byte[] directly to a string, it creates a 16 bit standard java char of each byte with some offset. As the trie code only produces bytes between 0-255 and it adds a offset of 0x30 to it, the range of chars is 0x30..0x12f. This range is unicode safe and can be easily encoded to UTF-8. But I will make a explicit testcase for that tomorrow!

Maybe your problem was the mentioned mis-use of directly generating strings from byte arrays using unknown/incorrect charset. Keep me informed!

By the way Earwin Burrfoot: are you sure, your 15bit encoding works with the latest Lucene version, maybe this affects you, too?

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Yep, I was simply using new String(byte[])

asfimport commented 15 years ago

Nadav Har'El (migrated from JIRA)

Hi, I just wanted to comment that a similar method was proposed in 2005 by several researchers in IBM (all of them have moved to Yahoo by now, actually). See "Inverted Index Support for Numeric Search", by M Fontoura, R Lempel, R Qi, J Zien http://webcourse.cs.technion.ac.il/236620/Winter2006-2007/hw/WCFiles/paramsearch.pdf The paper contains some discussion on the tradeoffs between the number of layers, number of terms, index size, and so on, so it might contain some valuable ideas even if you don't use their scheme exactly.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thank you, I did not know this paper. It is often rather complicated to find all papers about such subjects, very nice work!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Uwe, is this ready to be committed? It looks good, and the tests pass. Thanks!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It is almost complete. In my opinion the only change would be the setting of defaults. I wanted to move this into TrieUtils directly. Let me iterate one more patch and then we could commit. As I currently have no contributor status, it is simplier to first iterate the patch enough before committing. I was waiting for any additional comments before releasing a new patch version.

I did not had time to do some benchmarks, so testing the three trie variants for speed/disk io/indexing speed is not yet done. My current benchmarks affect only the performance of my "old" trie code, not the new one using the more binary encoding. I asked for a good "benchmarking framework", is contrib/benchmark useful for that? For benchmarking you need to create rather big indexes, maybe containing random numeric values. So the benchmark may also need much disk space (1 Gig?), ok you can leave out stored fields and additional full text fields, but the benchamrk should at last have also a normal tokenized field for performance with combining trie / ordinal term queries (like in the paper given by Nadav).

Do you think, it is good to directly include it into contrib-search? In my opinion, it is, but maybe others think different.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New version of the patch:

moved default trie variant handling to TrieUtils
method in TrieRangeFilter that returns the number of visited terms after execution of getDocIdSet()
tests print out the number of terms, other improvements to tests

Just one question: is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

From my point of view: the patch is now ready to commit, all tests pass.

The last patch also improved javadocs. Maybe some native speaker may please look into my javadocs, it is "Denglish" (as it is called in Germany)? :-)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I asked for a good "benchmarking framework", is contrib/benchmark useful for that?

I think it is good. You should probably make your own DocMaker to create random number fields, or, make a big line file and use LineDocMaker.

And then also create your own QueryMaker to create TrieRangeQuery's.

One nice test to add, to assert correctness, would be to iterate N times, randomly picking lower/upper bounds, and compare the builtin RangeFilter vs the TrieRangeFilter, asserting that they pull the same docs?

is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test?

I think that's fine.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I have some first "benchmarks" on the three trie implementations. They were done using three indexes with same content, but encoded in 8bit, 4bit and 2bit, containing real-world data in 575,000 documents of the PANGAEA repository (see www.pangaea.de as mentioned before). The same range queries were executed after some warmup and time was measured until TopDocs returned the first 10 documents.

The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). Index size (including the "normal" fields) was:

8bit: 4.8 GiB
4bit: 5.1 GiB
2bit: 5.7 GiB

So the addition 13*8*575,000 extra trie terms (with longer) size took 300 MB going from 8 to 4bit. Going from 4 to 2bit, the additional 13*16*575,000 extra terms took 600 MB (additional to the step from 8 to 4 bit, but its linear to the number of additional terms).

The performance impact was neglectible (or better the standard deviation was bigger than the performance plus), so I think still, 8bit is a good idea and index size is the smallest.

My idea, why this is so: Using fewer bits increases number of terms in index (I do not know how this decreases performance), needs more TermEnum seeks (for each trie precision, 2 seeking operations are needed). These term enum seeks are faster for 8bit trie varaint, because the 2 chars prefix can be used extensively (first prefix=precision, second char = first byte of numeric value). With 4bit and 2bit you get only 4bit or 2bits in addition to precision marker. So seeks in term enum are slower.

In comparison a ConstantScoreRangeQuery on the full precision trie-coded values took about 10-20 times longer. Here the numbers: For PANGAEA, all queries are half open (we have the problem, that data sets have ittself bounding boxes - minlatitude, maxlatitude, minlongitude, maxlongitude) and the user searches for bounding boxes, hits are datasets that intersect the search bounding box. So for a complete lat/lon constraint you have 4 half open ranges (with the current Trie implementation, this is not a problem, because two ANDed filters just withit half of the number of terms needed for a full range. The only performance impact is ANDing the two OpenBitSets). So 4 such half-open ranges ANDed together take the following times on the given index after some warmup, inkl fetching the first 10 documents:

ConstantScoreRange: 1500-4000 ms TrieRangeQuery 8bit: 40ms-100ms TrieRangeQuery 4bit: 30ms-80ms TrieRangeQuery 2bit: 20ms-80ms

The old RangeQuery was not tested, but that hits the MaxClauseCount constraint, and if this is raised to Integer.MAX_VALUE, it tooks approx 1 minute to finish). The numbers are pure rnage queries, if you additionally add some search terms, time goes up a little bit. Used was only TrieRangeQuery (so rewrite to constant score), no docs were filtered in the classical way: termqueries + filtering of results.

More benchmark results follow, when I finish the benchmarker. But these are real world examples. The search engine machine was a Sun X4600 machine with 16 opteron cores, 64bit Java 1.5, Sun Java System Web Server, -Xmx 8192M (our current configuration). On lower cost machnies like desktop pcs, the ConstantScoreRangeQuery takes much more time, but TrieRangeQuery is approx the same time, as disk io seeks is more the limiting factor. Processor speed and number of processors is not the limiting factor (if only one query is run in parallel).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Just one note: The queries in the comment before were taken on the same index with typical "user queries", the hit count and ranges were rather large (e.g. queries with about 120,000 total hits, so the typical case water around Africa, like Mike McCandless did in his first tests). They were not synthetic, done on the real running productive system (I only have two X4600 sun machines..., so tests are most time done on the life system, e.g. like a open-heart surgery).

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

Uwe, thanks very much for the size and performance info. The space/time tradeoff between 8 and 4 bits is clear, and it's great to have that fexibility.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New Patch, I think this is really ready to commit. Includes Mike's suggestion to compare the result count of RangeQuery with TrieRangeQuery with random values. Also contains a further test on result count of ranges with an index containing values with distance=1. This is to detect errors, when the code generating the splitted range may fail to correctly attach the range parts to each other.

During changing my own project panFMP to use the new contrib package, I needed a function to read stored, trie-encoded values for re-indexing with another trie variant. The new static methods in TrieUtils choose the variant for decoding using the encoded string length. These static methods can be used for easy decoding of stored fields that use the trie encoding without knowing the encoding. For range queries, the encoding cannot be autodetected (which is clear).

Further work maybe a optimized sort algorith for the trie encoded fields. Current FieldCache cannot handle them as longs, only as Strings which is memory intensive. Having a FieldCache implementation for longs that support this encoding would be good. I do not want to do this with the current unflexible FieldCache implementation, so I wait for #1906. Has anybody an idea, how to plugin the correct FieldCache impl for this encoding in current Lucene, respecting the TrieUtils variant?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Those results are nice, thanks Uwe! The open-heart surgery part was spooky though ;)

Index size (including the "normal" fields) was:

As a baseline, how large is an index with just the normal fields?

I'll commit shortly. This is a great addition!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am sorry, one JavaDoc bug in the new static method, has a @link too much after @throws

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

As a baseline, how large is an index with just the normal fields?

This is not so simply to check out, I do not have such an index available and it takes long time to reindex :) The 3 variants were generated last night on the X4600 monster machine, which took 2 hours there (in parallel).

But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about ⚠ 4.5 GiB.

Sometimes, when changing program code of "live" systems where users are working on (sometimes you cannot avoid it), I feel really like doing a open-heart surgery :-)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about 4.5 GiB

OK thanks!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Committed revision 723031.

Thanks Uwe!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Uwe, can't you use ExtendedFieldCache.getLongs, passing in your own LongParser?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The question is, how to plugin the whole FieldCache code from contrib? I see there is only one default ExtendedFieldCache named ExtendedFieldCacheImpl, but how do I implement a custom field cache and tell the searchers to use it whenever they see a Trie encoded field (e.g. when Sort.AUTO is used).? This is one thing, that I do not understand how to do.

A very simple LongParser may be:

  new LongParser() {
      public long parseLong(String value) {
        return TrieUtils.VARIANT_8BIT.trieCodedToLong(value);
      }
  };

But what to do with it for sorting search results...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache? (It is sort of messy though; ideally there would be some way to record the LongParser you'd like used for a given field).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This is messy and not easy to understand for a beginner. I was thinking of a possibility to supply a class in the contrib package o.a.l.search.trie that extends SortField or Sort or whatever else to mark a field explicitely as "trie-encoded" (which could also use the automatic trie variant detection in TrieUtils.trieCodedToLongAuto()), that can be used for calling Searcher.search(). And if TrieUtils would not be in Contrib, the AUTO parser could be extended to also try TrieUtils for deconding the field value (as it also throws NumberFormatException, it would fit in wonderful).

Another idea was to overwrite ExtendedFieldCacheImpl and set it in the (non-final) static member ExtendedFieldCache.EXT_DEFAULT (which would be possible), but ExtendedFieldCacheImpl is package-protected. The whole sort thing is a bit messy, this is why I was hoping that #1906 may fix this problem in future, maybe we move this discussion to this issue.

For the beginning, I could simply supply a static field parser instance in TrieUtils, e.g. LongParser named TrieUtils.TRIE_FIELD_CACHE_PARSER, that uses autodetection or the concrete trie variant.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache?

I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use. If search code would later call getLongs() with the default parser, it would get a new array of longs, not the one previously parsed with the specific parser, as there is no cache hit (because Entry is not equal). Do I miss something here?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike, the last Hudson build failed. It seems it did not like that I used "LuceneTestCase" instead of standard JUnit TestCase in one of the tests. I copied the source of this test from some other class. Maybe I cannot use LuceneTestCase in Contrib. But this is not a problem, it works identical with the JUNIT default class.

apache / lucene

Add TrieRangeFilter to contrib [LUCENE-1470] #2544

2446

2535