Support for index/search large numeric field [LUCENE-5596]

asfimport commented 10 years ago

Currently if an number is larger than Long.MAX_VALUE, we can't index/search that in lucene as a number. For example, IPv6 address is an 128 bit number, so we can't index that as a numeric field and do numeric range query etc.

It would be good to support BigInteger / BigDecimal

I've tried use BigInteger for IPv6 in Elasticsearch and that works fine, but there are still lots of things to do https://github.com/elasticsearch/elasticsearch/pull/5758

Migrated from LUCENE-5596 by Kevin Wang, 3 votes, resolved Mar 09 2016 Attachments: LUCENE-5596.patch (versions: 2) Linked issues:

SOLR-6741
- 6941

asfimport commented 10 years ago

Kevin Wang (migrated from JIRA)

initial patch to support BigInteger, I've copied and modified some tests for long field (e.g. TestNumericUtils, TestNumericTokenStream, TestNumericRangeQuery, TestSortDocValues) to support BigInteger and all passed.

asfimport commented 10 years ago

Kevin Wang (migrated from JIRA)

updated patch to store BigInteger as byte array instead of string for stored field and add BigInteger support for query parser

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hey, looks like a pretty thorough job here. Do you think BigNumericTokenStream should be combined with NumericTokenStream or kept separate? Maybe instead of NumericTokenStream working with longs in its setter (only for the attributeimpl to convert to bytes), it should just do that up front so it can work with BigInteger too?

Do you think the fieldcache impl should instead just act like a filtered BinaryDocValues (BytesRef api) instead of a BigInteger[]? This would allow it to just use a large paged byte[] instead of holding millions of biginteger objects. This would work fine for sorting too right (just compare bytes) ?

Maybe @uschindler can take a look when he gets some time.

asfimport commented 10 years ago

Kevin Wang (migrated from JIRA)

Hi @rmuir, for the NumericTokenStream, the getRawValue() and init() both only handles long value, so I don't want to change the existing behaviour. I'll change the field cache to bytes to see if it works.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I agree with Robert. To mee it looks like the code duplication comes from the original patch for ElasticSearch.

Hi Robert Muir, for the NumericTokenStream, the getRawValue() and init() both only handles long value, so I don't want to change the existing behaviour. I'll change the field cache to bytes to see if it works.

In any case: The already existing attributes in NumericTokenStream should be reused, they are duplicated in both classes, although the attributes are public! The one attribute which handles the value itsself can be duplicated or we move to BigInteger alltogether (but this has some overhead).

BigNumericUtils can be merged with NumericUtils. Please also sort the methods next to each other, so they are grouped by purpose and then following one for each type

FYI: There is already another issue in ES about BigDecimals and BigIntegers: https://github.com/elasticsearch/elasticsearch/pull/5683 and https://github.com/elasticsearch/elasticsearch/pull/5491 - It would be worth to comine the approaches. This issue is not only about the JSON encoding, they also talk about facetting and other stuff on these fields.

Finally, we should also add BigDecimal support. This just requires something like double2SortableLong() for BigDecimal -> BigInteger.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We can maybe have a common abstract base class for NumericTokenStream, defining common attributes and the algorithm, and subclasses per type that initializes the attributes. ...just throwing in ideas...

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

There is already another issue in ES about BigDecimals and BigIntegers [...] It would be worth to comine the approaches.

Maybe we should support variable-length bytes in general, so that this could also be applied to String fields? In a similar way to the fact that the edge-ngram and the shingle filters allow to pre-index prefix and phrase queries, I think it would also be nice to pre-index range queries.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There is one problem with the patch: Lucene currently encodes the shift value in the indexed tokens (xxxToPrefixCoded) with some offset as "type marker" (SHIFTSTART*; see e.g., http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.6.0/org/apache/lucene/util/NumericUtils.java#149). By that, it is ensured, that you hit no documents, if you index a field as integer but query as long, the differently encoded shift ensures that you don't find the term in the dictionary, so no documents are returned. With your patch, one can index a BigInteger and it might return random hits if queried as long or int range. Unfortunately the current "shift encoding" only supports fictional "short" and "byte", which is never used. A second problem is: You are limited to a maximum shift of 127 (or 255 if you correctly mask the shift byte) currently, otherwise the encoding overflows.

I am not sure how to handle this. The main problem is Lucene's schemaless design (the index does not know the type of the field, except for stored fields), the "shift encoding" with the type marker bits is just a hack around that, to no produce incorrect results.

Because of that we should really do some investigation before starting to push those changes in. Maybe only make it work on Lucene trunk only and change the index encoding completely.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Maybe we should support variable-length bytes in general, so that this could also be applied to String fields? In a similar way to the fact that the edge-ngram and the shingle filters allow to pre-index prefix and phrase queries, I think it would also be nice to pre-index range queries.

This is a good idea. We should keep the primitive types, which support precisionSteps on the bit level, but those types like BigDecimal or BigInteger should maybe use precisionSteps of multiples than 8 only. In fact this would only require to write a TokenStream that explodes every term to prefix terms, with shift markers. By that we would not need any numeric stuff to handle those anymore:

PrefixTermsTokenFilter as last TokenFilter
some generic MultiTermQuery that does a similar stuff like NumericUtils.splitTerms(), just on the byte[] level

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It would be great if we did that, and somehow fit numerictokenstream to be more "ordinary" in the analysis chain, such that e.g. the queryparser getRangeQuery could just do the right thing out of box...

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

but those types like BigDecimal or BigInteger should maybe use precisionSteps of multiples than 8 only

+1 Anything below would certainly be wasteful.

PrefixTermsTokenFilter / MultiTermQuery like NumericUtils.splitTerms()

This sounds good!

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would in that case don't do the BigInteger stuff in Lucene at all. ElasticSearch would just provide some byte[] to index and we take care of the encoding with a custom TokenFilter and provide ByteRangeQuery. For DocValues and stored fields no changes are needed at all (just use byte[] type already there for docvalues and stored fields).

We can implement this in the queries module, I don't think this needs to be in core.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I would in that case don't do the BigInteger stuff in Lucene at all.

Would it make sense to just have utility methods to convert BigDecimals and BigIntegers to sortable bytes?

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Would it make sense to just have utility methods to convert BigDecimals and BigIntegers to sortable bytes?

I think so.

asfimport commented 10 years ago

Kevin Wang (migrated from JIRA)

+1 for ByteRangeQuery, I'll work on a patch for that first

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I would start with a TokenStream only, like we do for NumericTokenStream. The solution with a TokenFilter is currently not doable, because of the coupling of TermToBytesRef attribute and CharTermAttribute with only using an AttributeFactory as workaround to decouple. I am thinking about a better solution for Lucene 5.0, this is one of the big issues preventing people from indexing pure-byte[] terms.

About the ES issue: If we have some byte[] support, there is no need to use BigInteger for IPv6 adresses! Just use http://docs.oracle.com/javase/7/docs/api/java/net/Inet6Address.html#getAddress() to get the bytes of the address in network byte order (big endian), no need to convert to BigInteger first and deal with the stupid signedness issues. This would also work with IPv4 addresses, which return a byte\[4\] on the same method (see base class http://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#getAddress()).

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1 for ByteRangeQuery, I'll work on a patch for that first

Maybe we should discuss the API first, I don't like people doing lots of code refactoring without knowing how it should look like. I have more ideas, maybe we should name it "BinaryField", which then would support all: indexing, storing, docvalues. Currently binary only works for stored fields and so on. So we should craefully think about that.

The precisionStep (in that case not with "bit" as unit, but full "bytes") would just be an option for those binary terms. We can then add a BinaryRangeQuery, that can optionally use a precision step.

Under the hood, we could also add precisionStep support for TermRangeQuery (just an idea)...

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Maybe we should discuss the API first

I'm also interested in how we plan to encode terms. NumericUtils currently encodes shifts on a single byte but if we aim at supporting variable-length data, I guess we either need to enforce a maximum length of 255 on terms or use a different encoding.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I'm also interested in how we plan to encode terms. NumericUtils currently encodes shifts on a single byte but if we aim at supporting variable-length data, I guess we either need to enforce a maximum length of 255 on terms or use a different encoding.

I agree! There are 2 problems:

The idea behind the "prefixed" shift value byte is: Terms with one shift value should be grouped together in the index. So BytesRef.compareTo(otherBytesRef) should behave like: terms with smaller shift value should be binary smaller. The full-length terms must come first (shift value = 0). Based on this, term enumerations can easily seek and stuff like FieldCache can stop iterating terms after visiting all zero-shift terms. Also we can make the MultiTermQuery only seek forward in the TermsEnum (very important) - zthis is the reason for the whole setup we currently have!
If we need longer terms than 255, we would need 2 bytes to encode maximum shift. On the other hand this is wasteful not only because of an additional byte per term, it is also wasteful because of number of terms, where it is unlikely to have many terms that differ only on the last few bytes. Prefix encoding only makes sense for common prefixes which appear millions of times in your index, Maybe we should instead of storing the shift value in the first byte, store the (inverse) number of preserved bits (255-preservedBits)! Important is: byte 0 => full precision. Longer terms are only prefixed up to a maximum length, the remaining stuff is only stored full precision. In fact we only apply prefix terms to the first n bytes of the term, everything loger gets stored in full precision only. I think, it makes no sense to have longer prefixes than maybe 8 bytes in the index.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

About the API: We should not add new types at all. We need 2 things:

a BinaryField that is just indexing the given byte\[\] as is (http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/document/Field.html, see ctor taking byte\[\], offset, count). Optionally, enable "prefix encoding" (which is the only new feature), otherwise changes are minimal.
TermRangeQuery already works with non-prefix encoded bytes: <http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/search/TermRangeQuery.html#TermRangeQuery(java.lang.String, org.apache.lucene.util.BytesRef, org.apache.lucene.util.BytesRef, boolean, boolean)>, we just need to add support for prefix encoded terms (parameter like precisionStep or FieldType - i think FieldType is better to drive the range query, so it automatically knows how to enumerate terms). By default, if no FieldType is given, it just assumes non-prefix encoded terms.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Longer terms are only prefixed up to a maximum length, the remaining stuff is only stored full precision. In fact we only apply prefix terms to the first n bytes of the term, everything loger gets stored in full precision only. I think, it makes no sense to have longer prefixes than maybe 8 bytes in the index.

+1 I think this makes sense to have a maximum prefix length (that would be configurable hopefully) and enforce that this maximum prefix length is less than 255.

This makes me wonder it would be nice to have something that adapts itself to the data, and would only index prefixes that match more than X terms (maybe similarly to the way that the block tree terms dict tries to share prefixes). But this looks significantly harder to implement! Maybe #6485 could be of interest as well: for example if the set of terms that have "XY" and "XYZ" as prefixes are the same, they could point to the same postings list. (Just wild ideas, feel free to ignore them :))

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

(parameter like precisionStep or FieldType - i think FieldType is better to drive the range query, so it automatically knows how to enumerate terms). By default, if no FieldType is given, it just assumes non-prefix encoded terms.

I opened #6667 about passing FieldType to NumericRangeQuery, too.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

This is going to be largely obsolete by #6941. All that will remain to do is to encode the IPv6 into a single term, probably with the 16 byte representation, and that's it. Alternatively you might use half of each byte and thus use 32 bytes which could result in even faster range queries... but I probably wouldn't bother.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Agreed, I think we can close this issue.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think even if we somehow get #6941 in, there is still work to do under this issue, e.g. to add sugar to translate the IP address to the right byte[], to accept range queries in query parser, ...

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Closing; it's already been done via BinaryPoint in Lucene 6.0.

apache / lucene

Support for index/search large numeric field [LUCENE-5596] #6658

6941