Closed keenangraham closed 2 years ago
Pinging @elastic/es-search (:Search/Mapping)
We discussed internally and while there are things that we could do internally to optimize exists
query for object fields, the workaround to speed up the request is to pick a leaf sub-field that appears in all documents. It's also unclear why you're seeing a regression, is it because you have a lot of sub-fields ?
is it because you have a lot of sub-fields
Right, some of our documents definitely have a lot of subfields, but exactly the same amount as on 5.6.
the workaround to speed up the request is to pick a leaf sub-field that appears in all documents
Internally we have discussed having a workaround like this, expanding an exists
query on an object
field to a known subfield that appears in all documents. In our case this is the @id
field, which all documents and inner objects should have but we don't strictly enforce. The problem with implementing this at our application layer is that we allow users to search across all of our document types (experiments
, files
, libraries
, etc.) where a certain field exists, and in one case the field could be an object
type and in another it could be a keyword
type. In other words we can have fields with the same name but different mapping types, so appending .@id
whenever an object
type field is encountered is a little heavy-handed.
I've actually experimented with spinning up a custom ES version where this line is modified from .*
-> .@id
: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L182
This actually works okay in terms of performance and handling different mapping types appropriately. For example searching for embedded.replicates
gets expanded to embedded.replicates.@id
when the embedded.replicates
field is an object
type (experiment), but is left the same when it is not (library).
$ curl -X GET "localhost:9201/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "exists": { "field": "embedded.replicates" } }, "profile": true, "size": 0 } '
...
[experiment] ConstantScore(DocValuesFieldExistsQuery [field=embedded.replicates.@id]
[library] ConstantScore(DocValuesFieldExistsQuery [field=embedded.replicates]
The downside is that now we're leaking our application internals into ES and hardcoding some possibly brittle assumptions. It also seems risky to put this into production and expect we'll ever be able upgrade ES again. It's also a pain to force anyone who forks our repo to use our custom ES version.
Ideally there would be a more robust solution with official support. It's not clear to me exactly how a terms
query on _field_names
worked with object
type fields (ConstantScore(_field_names:foo.bar)
), and why it was so much faster. Do you know if only the full path names were stored in _field_names
field, or also object
type paths? I guess it would be tricky to make a _field_names
solution work for only object
type fields, and keep DocValuesFieldExistsQuery
around for concrete fields.
In any case thanks for giving this some thought!
In previous versions, the path of the object fields was also added in the _field_names
field so a single term query was needed. However indexing the _field_names
field has a non-negligible cost that we tried to reduce with https://github.com/elastic/elasticsearch/pull/26930 and the consequence is that exists
query on objects with lots of sub-fields are slower. That's a trade off that we find acceptable since the number of fields under an object should be manageable.
You're mentioning that your application layer is using a number of document types, have you considered using one index per type ? Mixing multiple types in the same index should be avoided and doesn't provide any advantages compared to an index per type.
Right, to clarify we do use one index per type already. And while we do usually limit the indices we search over we do support the use case of searching over all indices for existence of a field or a certain value etc.
That's a trade off that we find acceptable since the number of fields under an object should be manageable.
I'll also just mention that we observe a nearly linear degradation in response time depending on the number of facets we render for a search with the new exists
query (i.e. one facet takes 0.676s, two facets take 1.307s, three facets take 2.842s etc.). Underneath these are filter aggregations
with exists
queries in the filter context. So even if your field are manageable it seems like these could add up if you are trying to render some kind of faceted search.
As a concrete example we can search over all of our Experiment
objects (type=Experiment
) where replicates
field equals something (replicates=*
): https://www.encodeproject.org/search/?type=Experiment&replicates=*
We render a number of facets to allow the user to refine their search:
You can see the raw ES query we build up from URL params by adding debug=true&format=json
:
https://www.encodeproject.org/search/?type=Experiment&replicates=*&debug=true&format=json
Thanks!
Just want to pop in here and let you guys know that our system https://github.com/4dn-dcic/fourfront, https://data.4dnucleome.org/ experiences exactly the same regression Keenan is describing. We are running ES 6.8 in AWS ElasticSearch Service in production. Our system, like theirs, allows you to define documents with object fields and fields that link to other objects (and potentially embed fields from that object, 1 item type per index). There are two optimizations we made that build on each other that make this issue less crippling for us, hence why it is acceptable to run in production (we are also at smaller scale).
exists
query on a object field that is a link to another item in our system, instead of doing an exists
query on the object, our application translates this into an exists
query on a field known to exist on all (linkTo) items (this is exactly the solution suggested, which while works is an application side hack that I don't think should be necessary). It is non-trivial to apply this generally to non linkTo object fields in our system, so if on our application you happen to specify a query string URL that translates to an exists
query on a sufficiently large object field, the query will be anywhere from 20-50x slower than usual (if you're lucky) and often results in 504 application side. In fact if the object field is large enough the query will fail to generate altogether (without raising max_clause_count
). To emphasize, the performance degradation is extreme. If optimization 1 is applied, we get single-digit ms response times as expected since we are doing a single exists
query on a string field we know is present on all objects. However if it is not the queries become very slow. I have not verified precise linearity but generally it seems the larger the object field, the slower the query. We have some object fields that are large (but do not trigger max_clause_count
errors) and query consistently ~500x slower due to the abundant exist
sub-queries. You can see why this is by looking at the profile, as Keenan did, which shows all the DocValuesFieldExistsQuery [field=<every_single_child_path>]
. This is especially punishing when doing aggregations to support faceted search.
I think the point that @keenangraham and I are trying to make here is that this should be done internally by ElasticSearch to optimize queries of this type. One might consider also updating documentation so users are aware of this behavior and can make these internal optimizations themselves without having to fork, which obviously isn't even a possible solution for my org.
I have been catching up on this issue, as exists query against objects is a topic that came up in a recent discussion. I am sorry about the regression that users have been seeing for a long time now.
I find it a bit surprising that users are relying on exists query against objects, and have been wondering what their usecase is. Mostly I worry that their behaviour can be trappy. Effectively an exists query against a non leaf field, is equivalent to an exists query against field.*
, which brings a couple of implications:
All in all I wonder if instead of supporting exists queries against objects by expanding the provided pattern, we should have requested that users explicitly provided the wildcard pattern in the first place. The current behaviour is not strictly to ensure that an object exists, but more to ensure that some field is defined in the document within the namespace of the provided field name, which is much better represented by the field.*
pattern.
This is part of a broader discussion around whether we even need objects in our mappings in the first place, given that they are a purely artificial construct with no corresponding representation in lucene, where the data is effectively stored. We actually don't need objects in the mappings to support the current exists query behaviour against objects.
I would say that the current expansion of exists query against objects is technically correct, and we have no other way given that we lean on the corresponding exists query in lucene whenever leaf fields have doc_values enabled. One alternate way that would lead to a different behaviour though could be to look at the _source (can be done through a runtime field too) but that would definitely not be faster than the current way.
With this said, I am sorry to report that I don't see the team focusing on optimizing exists queries against objects fields for the time being, hence I am closing this issue. Going back we would have exposed this functionality differently, but we are not keen on breaking existing behaviour either at this point, so we will leave things are they are.
Thanks for making an awesome search tool.
We are in the process of upgrading our production Elasticsearch version from
5.6
to7.9.3
and have noticed a serious degradation in response time forexists
queries onobject
fields. We make heavy use ofexists
queries inquery contexts
as well as infilter aggregations
in order to provide flexible and faceted search over multiple document types containing complex metadata (e.g. https://www.encodeproject.org/search/?type=Experiment). By complex I mean objects or array of objects inside objects etc.Our API allows users to filter documents based on existence of arbitrary fields/subfields, including fields that are
object
type. In 5.6 anexists
query on anobject
type field was converted into what seemed like a relatively quickConstantScore(_field_names:foo.bar)
query. In 7.9.3 it is recursively expanded to aDocValuesFieldExistsQuery
on all children paths. In our case this results in aboolean
query with thousands ofDocValuesFieldExistsQuery
terms, calculated over again for everyfilter aggregation
that's used for a facet. What was previously a quick query now runs into the default Python client timeout of 30 seconds.From what I can tell by the work on https://github.com/elastic/elasticsearch/pull/26930 by @colings86 and @jpountz moving the implementation of
exists
query out to concrete mapping types makes sense and saves indexing overhead of_field_names
.However since
object
type isn't a concrete type (no corresponding Lucene type?) it doesn't have its ownexists
query implementation (https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/mapper/ObjectMapper.java) and has to be dealt with specially in the query builder: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L171-L175And since internally documents are actually flattened:
The only way to run an
exists
query now on anobject
field is to expand it out to all of its concrete children paths: https://github.com/elastic/elasticsearch/blob/c4e3fc45f0a781d528c198866ee3de178f562991/server/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java#L182I'm wondering if there is some way to optimize this query, either by short circuiting the expansion or bringing back some type of
_field_names
functionality ofobject
types? From an external perspective we just care whether or not the internal object at that path exists, not whether any of the expanded nested fields exist, and we would gladly trade more indexing space to maintain performance. I get this is tricky from an implementation perspective.Here is simplified reproduction, though you probably wouldn't notice performance degradation until running at scale:
Elasticsearch version (
bin/elasticsearch --version
):JVM version (
java -version
):OS version (
uname -a
if on a Unix-like system):On
7.9.3
create mapping for complex document:Add document:
Look at
exists
query profile:Profile contains all expanded children paths:
On 5.6 profile only contains query on
_field_names
:Ideally there would be some way to rewrite the new
exists
query to give similar response times as before. I know this isn't exactly a bug since it works as intended, but the degradation in performance is an issue that's currently preventing us from releasing7.9.3
in production.Any suggestions or improvements here would be great. Thanks!