elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.2k stars 24.84k forks source link

Optimize exists queries on object fields #64837

Closed keenangraham closed 2 years ago

keenangraham commented 4 years ago

Thanks for making an awesome search tool.

We are in the process of upgrading our production Elasticsearch version from 5.6 to 7.9.3 and have noticed a serious degradation in response time for exists queries on object fields. We make heavy use of exists queries in query contexts as well as in filter aggregations in order to provide flexible and faceted search over multiple document types containing complex metadata (e.g. https://www.encodeproject.org/search/?type=Experiment). By complex I mean objects or array of objects inside objects etc.

Our API allows users to filter documents based on existence of arbitrary fields/subfields, including fields that are object type. In 5.6 an exists query on an object type field was converted into what seemed like a relatively quick ConstantScore(_field_names:foo.bar) query. In 7.9.3 it is recursively expanded to a DocValuesFieldExistsQuery on all children paths. In our case this results in a boolean query with thousands of DocValuesFieldExistsQuery terms, calculated over again for every filter aggregation that's used for a facet. What was previously a quick query now runs into the default Python client timeout of 30 seconds.

From what I can tell by the work on https://github.com/elastic/elasticsearch/pull/26930 by @colings86 and @jpountz moving the implementation of exists query out to concrete mapping types makes sense and saves indexing overhead of _field_names.

However since object type isn't a concrete type (no corresponding Lucene type?) it doesn't have its own exists query implementation (https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/mapper/ObjectMapper.java) and has to be dealt with specially in the query builder: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L171-L175

And since internally documents are actually flattened:

Screen Shot 2020-11-09 at 12 03 59 PM

The only way to run an exists query now on an object field is to expand it out to all of its concrete children paths: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L182

I'm wondering if there is some way to optimize this query, either by short circuiting the expansion or bringing back some type of _field_names functionality of object types? From an external perspective we just care whether or not the internal object at that path exists, not whether any of the expanded nested fields exist, and we would gladly trade more indexing space to maintain performance. I get this is tricky from an implementation perspective.

Here is simplified reproduction, though you probably wouldn't notice performance degradation until running at scale:

Elasticsearch version (bin/elasticsearch --version):

$ curl localhost:9201
{
  "version" : {
    "number" : "7.9.3",
    "lucene_version" : "8.6.2",
  }
}

JVM version (java -version):

$ java -version
java version "11.0.3" 2019-04-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.3+12-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.3+12-LTS, mixed mode)

OS version (uname -a if on a Unix-like system):

$ uname -a
18.04.1-Ubuntu

On 7.9.3 create mapping for complex document:

curl -X PUT "localhost:9201/labs?pretty" -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "address": {
                "type": "keyword"
            },
            "pi": {
                "properties": {
                    "email": {
                        "type": "keyword"
                    }
                }
            },
            "experiments" : {
                "properties" : {
                    "name": {
                        "type" : "keyword"
                    },
                    "antibody" : {
                        "properties": {
                            "antigen_description" : {
                                "type" : "keyword"
                            },
                            "antigen_sequence" : {
                                "type" : "keyword"
                            }
                        }
                    }
                }
            }
        }
    }
}
'

Add document:

curl -X PUT "localhost:9201/labs/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
    "name": "foo",
    "address": "bar",
    "pi": {
        "email": "foo@bar.com"
    },
    "experiments": [
        {
            "name": "abc",
            "antibody": {
                "antigen_description": "123",
                "antigen_sequence": "xyz"
            }
        },
        {
            "name": "def",
            "antibody": {
                "antigen_description": "456",
                "antigen_sequence": "zyx"
            }
        }
    ]
}
'

Look at exists query profile:

curl -X GET "localhost:9201/labs/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "exists": {
            "field": "experiments"
        }
    },
    "profile": true
}
'

Profile contains all expanded children paths:

ConstantScore(
    DocValuesFieldExistsQuery [field=experiments.antibody.antigen_sequence]
    DocValuesFieldExistsQuery [field=experiments.antibody.antigen_description]
    DocValuesFieldExistsQuery [field=experiments.name]
)

On 5.6 profile only contains query on _field_names:

ConstantScore(_field_names:experiments)

Ideally there would be some way to rewrite the new exists query to give similar response times as before. I know this isn't exactly a bug since it works as intended, but the degradation in performance is an issue that's currently preventing us from releasing 7.9.3 in production.

Any suggestions or improvements here would be great. Thanks!

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Mapping)

jimczi commented 3 years ago

We discussed internally and while there are things that we could do internally to optimize exists query for object fields, the workaround to speed up the request is to pick a leaf sub-field that appears in all documents. It's also unclear why you're seeing a regression, is it because you have a lot of sub-fields ?

keenangraham commented 3 years ago

is it because you have a lot of sub-fields

Right, some of our documents definitely have a lot of subfields, but exactly the same amount as on 5.6.

the workaround to speed up the request is to pick a leaf sub-field that appears in all documents

Internally we have discussed having a workaround like this, expanding an exists query on an object field to a known subfield that appears in all documents. In our case this is the @id field, which all documents and inner objects should have but we don't strictly enforce. The problem with implementing this at our application layer is that we allow users to search across all of our document types (experiments, files, libraries, etc.) where a certain field exists, and in one case the field could be an object type and in another it could be a keyword type. In other words we can have fields with the same name but different mapping types, so appending .@id whenever an object type field is encountered is a little heavy-handed.

I've actually experimented with spinning up a custom ES version where this line is modified from .* -> .@id: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L182

This actually works okay in terms of performance and handling different mapping types appropriately. For example searching for embedded.replicates gets expanded to embedded.replicates.@id when the embedded.replicates field is an object type (experiment), but is left the same when it is not (library).

$ curl -X GET "localhost:9201/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "exists": { "field": "embedded.replicates" } }, "profile": true, "size": 0 } '
...
[experiment] ConstantScore(DocValuesFieldExistsQuery [field=embedded.replicates.@id]
[library] ConstantScore(DocValuesFieldExistsQuery [field=embedded.replicates]

The downside is that now we're leaking our application internals into ES and hardcoding some possibly brittle assumptions. It also seems risky to put this into production and expect we'll ever be able upgrade ES again. It's also a pain to force anyone who forks our repo to use our custom ES version.

Ideally there would be a more robust solution with official support. It's not clear to me exactly how a terms query on _field_names worked with object type fields (ConstantScore(_field_names:foo.bar)), and why it was so much faster. Do you know if only the full path names were stored in _field_names field, or also object type paths? I guess it would be tricky to make a _field_names solution work for only object type fields, and keep DocValuesFieldExistsQuery around for concrete fields.

In any case thanks for giving this some thought!

jimczi commented 3 years ago

In previous versions, the path of the object fields was also added in the _field_names field so a single term query was needed. However indexing the _field_names field has a non-negligible cost that we tried to reduce with https://github.com/elastic/elasticsearch/pull/26930 and the consequence is that exists query on objects with lots of sub-fields are slower. That's a trade off that we find acceptable since the number of fields under an object should be manageable.

You're mentioning that your application layer is using a number of document types, have you considered using one index per type ? Mixing multiple types in the same index should be avoided and doesn't provide any advantages compared to an index per type.

keenangraham commented 3 years ago

Right, to clarify we do use one index per type already. And while we do usually limit the indices we search over we do support the use case of searching over all indices for existence of a field or a certain value etc.

keenangraham commented 3 years ago

That's a trade off that we find acceptable since the number of fields under an object should be manageable.

I'll also just mention that we observe a nearly linear degradation in response time depending on the number of facets we render for a search with the new exists query (i.e. one facet takes 0.676s, two facets take 1.307s, three facets take 2.842s etc.). Underneath these are filter aggregations with exists queries in the filter context. So even if your field are manageable it seems like these could add up if you are trying to render some kind of faceted search.

As a concrete example we can search over all of our Experiment objects (type=Experiment) where replicates field equals something (replicates=*): https://www.encodeproject.org/search/?type=Experiment&replicates=*

We render a number of facets to allow the user to refine their search:

Screen Shot 2020-11-23 at 2 21 29 PM

You can see the raw ES query we build up from URL params by adding debug=true&format=json: https://www.encodeproject.org/search/?type=Experiment&replicates=*&debug=true&format=json

Screen Shot 2020-11-23 at 2 33 16 PM

Thanks!

willronchetti commented 3 years ago

Just want to pop in here and let you guys know that our system https://github.com/4dn-dcic/fourfront, https://data.4dnucleome.org/ experiences exactly the same regression Keenan is describing. We are running ES 6.8 in AWS ElasticSearch Service in production. Our system, like theirs, allows you to define documents with object fields and fields that link to other objects (and potentially embed fields from that object, 1 item type per index). There are two optimizations we made that build on each other that make this issue less crippling for us, hence why it is acceptable to run in production (we are also at smaller scale).

  1. When running an exists query on a object field that is a link to another item in our system, instead of doing an exists query on the object, our application translates this into an exists query on a field known to exist on all (linkTo) items (this is exactly the solution suggested, which while works is an application side hack that I don't think should be necessary). It is non-trivial to apply this generally to non linkTo object fields in our system, so if on our application you happen to specify a query string URL that translates to an exists query on a sufficiently large object field, the query will be anywhere from 20-50x slower than usual (if you're lucky) and often results in 504 application side. In fact if the object field is large enough the query will fail to generate altogether (without raising max_clause_count).
  2. When deciding which fields to embed in the source document, we are selective in that all fields embedded must be specified - this greatly reduces the size of object fields, as some of our documents are quite large (note that this doesn't matter for us since we optimize this query in the manner mentioned above). In Keenan's case they are embedding all fields by default, so you can imagine it is common for them to search on a large object field.

To emphasize, the performance degradation is extreme. If optimization 1 is applied, we get single-digit ms response times as expected since we are doing a single exists query on a string field we know is present on all objects. However if it is not the queries become very slow. I have not verified precise linearity but generally it seems the larger the object field, the slower the query. We have some object fields that are large (but do not trigger max_clause_count errors) and query consistently ~500x slower due to the abundant exist sub-queries. You can see why this is by looking at the profile, as Keenan did, which shows all the DocValuesFieldExistsQuery [field=<every_single_child_path>]. This is especially punishing when doing aggregations to support faceted search.

I think the point that @keenangraham and I are trying to make here is that this should be done internally by ElasticSearch to optimize queries of this type. One might consider also updating documentation so users are aware of this behavior and can make these internal optimizations themselves without having to fork, which obviously isn't even a possible solution for my org.

javanna commented 2 years ago

I have been catching up on this issue, as exists query against objects is a topic that came up in a recent discussion. I am sorry about the regression that users have been seeing for a long time now.

I find it a bit surprising that users are relying on exists query against objects, and have been wondering what their usecase is. Mostly I worry that their behaviour can be trappy. Effectively an exists query against a non leaf field, is equivalent to an exists query against field.*, which brings a couple of implications:

All in all I wonder if instead of supporting exists queries against objects by expanding the provided pattern, we should have requested that users explicitly provided the wildcard pattern in the first place. The current behaviour is not strictly to ensure that an object exists, but more to ensure that some field is defined in the document within the namespace of the provided field name, which is much better represented by the field.* pattern.

This is part of a broader discussion around whether we even need objects in our mappings in the first place, given that they are a purely artificial construct with no corresponding representation in lucene, where the data is effectively stored. We actually don't need objects in the mappings to support the current exists query behaviour against objects.

I would say that the current expansion of exists query against objects is technically correct, and we have no other way given that we lean on the corresponding exists query in lucene whenever leaf fields have doc_values enabled. One alternate way that would lead to a different behaviour though could be to look at the _source (can be done through a runtime field too) but that would definitely not be faster than the current way.

With this said, I am sorry to report that I don't see the team focusing on optimizing exists queries against objects fields for the time being, hence I am closing this issue. Going back we would have exposed this functionality differently, but we are not keen on breaking existing behaviour either at this point, so we will leave things are they are.