apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Multi-Value Support for Binary DocValues [LUCENE-10666] #11702

Open asfimport opened 2 years ago

asfimport commented 2 years ago

11690 introduces a binary doc value format for shapes. Since the geometries are decomposed into triangles the binary docvalue encoding can technically support multi shapes and geometry collections in a single doc value format, however it feels this is hacking around the limitation of supporting multi-values for binary doc values. With multi-value binary individual geometries can be stored in their own binary doc value w/ multiple binary per doc. I'd like to open this issue to explore adding multi-value support to binary doc values. Are there concerns, limitations, traps?


Migrated from LUCENE-10666 by Nick Knize (@nknize), 2 votes, updated Jul 29 2022

asfimport commented 2 years ago

Navneet Verma (migrated from JIRA)

Hi,

I was following on the PR and I would like to work on Multi-Value Support for Binary DocValues issue(LUCENE-10666) and was wondering if there are any concerns.

msokolov commented 2 years ago

I haven't seen any objections, and it makes sense to me that we may want to have multiple values here, analogous to other doc values types.

jpountz commented 2 years ago

The historical objection against multi-value binary support is that it could be easily implemented on top of binary doc values. So multi-value binary support would add API surface and push more complexity on codecs and IndexingChain while mostly providing syntactic sugar for users.

We've been following this approach of encoding multi-valued fields in a single BinaryDocValuesField in Elasticsearch for some less common field types like range and geo-shape fields, and it's not always easy in practice: you need to collect all values before being able to encode all values into a BinaryDocValuesField that can be added to a Document, which is a bit more involved than adding new Field instances to a Document as we see them.

I wonder if there are intermediate options worth exploring, like adding tooling to make it easier to encode multiple values into a single BinaryDocValuesField on the write side, and creating wrappers around BinaryDocValues instances on the read side that expose multiple values. Similarly to how FeatureField didn't add more surface to codec APIs and encodes floats as term frequencies of postings.

rmuir commented 2 years ago

The use-case here is also not great, talking about a doc having multiple locations. Its a pet peeve of mine, I don't think we should add a new major docvalues type for such crap :)

nknize commented 1 year ago

My muscle memory (and gmail filters) is stuck at jira :) So I missed these.

The use-case here is also not great, talking about a doc having multiple locations

Curious why you think a document can't have multiple locations? Why wouldn't the geo (wkt, json, wkb, protobuf) specification then not have Multi geometry types? The reason they do is because multi location can exist for a single document. It happens all the time, especially in data science applications where multiple observations are collected concurrently in a single document scan (RADAR, Multi Hypothesis Tracking). I should be able to have a multi value doc value for running facets, aggregations, spark jobs over the data stored in the lucene segment instead of trying to hack together a single encoding that stores all of the observations at once, and then post filter after decoding that entire binary value.

while mostly providing syntactic sugar for users.

Except in the case of shape doc values this isn't syntactic sugar. The way the centroid is computed for a multi value shape is based on weighted area of the individual geometries, and in this case the way the centroid is computed and stored in a shape doc value for a multi shape geometry is a hack because of this limitation. Assuming this is syntactic sugar just feels like a lazy way to not support multi-value companion for binary doc values. :/

scampi commented 1 year ago

I was involved in a previous issue that is related to this one. The problem was a drop of performance when scanning SortedSetDocValues docvalues (i.e., the keyword field in Elasticsearch). The solution was rightfully to use BinaryDocValues for this kind of access pattern (i.e., full scan of the column).

Therefore, we created a prototype that implements multi-valued binary docvalues which works well. However, having some support for this use case directly in Lucene is preferable, be it a new docvalues or some tooling as proposed by @jpountz .

Performance issues of scanning multi-valued binary data is probably something that would affect other use cases, e.g., the ESQL query language/engine proposed by Elastic.

jpountz commented 1 year ago

e.g., the ESQL query language/engine proposed by Elastic.

I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term. It will still be up to users to configure their mappings correctly for the sort of aggregation that they plan to run: SORTED(_SET) for sorting, grouping, and unique counts, BINARY for operations that require actually looking at the data and can't work on ordinals.

rmuir commented 1 year ago

Curious why you think a document can't have multiple locations? Why wouldn't the geo (wkt, json, wkb, protobuf) specification then not have Multi geometry types? The reason they do is because multi location can exist for a single document. It happens all the time, especially in data science applications where multiple observations are collected concurrently in a single document scan (RADAR, Multi Hypothesis Tracking). I should be able to have a multi value doc value for running facets, aggregations, spark jobs over the data stored in the lucene segment instead of trying to hack together a single encoding that stores all of the observations at once, and then post filter after decoding that entire binary value.

because in the real world objects can only exist in one place a a time. That's an actual fact. And the way it works in the search engine, doing things like sorting by distance, really only makes sense with single valued fields.

This is why i hate all multi-valued docvalues, because its always so ambiguous. If i have 3 locations for the doc and i'm sorting by distance, which one should i use? etc etc.

If someone wants to encode multiple values into a binary docvalues, nothing is stopping them. they can encode integer/byte length up front, do a vint-like encoding, whatever they want.

nknize commented 1 year ago

because in the real world objects can only exist in one place a a time.

Except in geo search / analysis this depends on spatial resolution of the source data; real world geo data is not precise and often ends up with multiple documents in the same location. Analysis mechanisms (e.g., aggregations) help to dedup or further analyze and score these documents. Clearly (not through a hack) supporting these use cases along with supporting coverage areas as a multi geometry shape only makes Lucene stronger. I don't think there's anything wrong w/ supporting standards like RFC 7946 in our encoding.

This is why i hate all multi-valued docvalues, because its always so ambiguous. If i have 3 locations for the doc and i'm sorting by distance, which one should i use? etc etc.

It depends on spatial resolution. Besides, we support these use cases already, we don't support the multi-shape use case above without an unnecessarily bloated hackey encoding.

If someone wants to encode multiple values into a binary docvalues, nothing is stopping them. they can encode integer/byte length up front, do a vint-like encoding, whatever they want.

It's software, yes. Nothing is also stopping anyone from encoding the bible and calling it a geo_shape.

nknize commented 1 year ago

Therefore, we created a prototype that implements multi-valued binary docvalues which works well. However, having some support for this use case directly in Lucene is preferable, be it a new docvalues or some tooling as proposed by @jpountz .

@scampi Would you be willing to contribute your multi-valued binary doc value implementation here? I think having multi-value parity with other doc value types is good to support multiple use cases like this. Per concerns already raised it would be good to slap warnings in the API doc that communicate potential trappy performance issues.

rendel commented 1 year ago

I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term.

@jpountz This may be correct for aggregate operation, however, if you wish to support join operation in ESQL at some point, then you'll need to perform a scan of the binary values and not of the ordinal values (as they are not compatible with a join operation).

Would you be willing to contribute your multi-valued binary doc value implementation here? I think having multi-value parity with other doc value types is good to support multiple use cases like this.

@nknize We do not see any problem in sharing the code, but our implementation is based on the Elasticsearch framework (on the BinaryFieldMapper.CustomBinaryDocValuesField to be exact), not Lucene, so likely it will be not relevant for this thread.

msokolov commented 1 year ago

just want to point out that objects exist in many different places in space-time

nknize commented 1 year ago

our implementation is based on the Elasticsearch framework (on the BinaryFieldMapper.CustomBinaryDocValuesField to be exact)

@rendel I haven't looked at that implementation since Elastic relicensed so my knowledge is dated. I presume you still use the ALv2 version? In other words, does it need anything other than the byte array list and corresponding binaryValue implementation? Could you use OpenSearch's implementation?

rendel commented 1 year ago

Could you use OpenSearch's implementation?

@nknize Yes, that is similar to this implementation.

navneet1v commented 1 year ago

@rendel so what is the final outcome here? Should we start working on Multi-Value Support for Binary DocValues?

@nknize