Closed jtibshirani closed 5 years ago
Pinging @elastic/es-search-aggs
The initial example on #25312 suggests that object fields would be indexed like text since it indexes city: "New York"
as [ "city:new", "city:york" ]
but your plan here doesn't mention analysis and gives examples with fields that are usually indexed as keywords like referer
and content-type
. Which route do we want to follow? Keyword-like indexing could be achieved with a keyword
analyzer, but I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.
Additionally, tokens are created for each value alone: text/html, https://google.com
Why do we plan to index individual tokens alone? I suspect most users won't want/need to search the entire object, meaning that this feature will double the size of the inverted index for a feature they don't need? Since you already mentioned having some sort of copy_to
support, it could be a way that users who need this feature could copy the content of all fields to another catch-all field?
As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), common, query_string, simple_query_string, exists. Highlighting will be supported.
If we go with keyword-style indexing, then we should probably skip highlighting, which is only useful with text fields? (matched queries are typically used instead for structured content)
Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix.
+1 I suspect it will be quite easy actually.
Create benchmarks for searching based on object keys.
That would be nice of course, but I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.
Thanks @jpountz for your thoughts.
I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.
I agree — in the potential use cases we’ve seen, the data is better modelled as keywords than text. The most critical feature is the ability to filter by an exact match on key-value pairs, and performing aggregations and sorting on the values would also be nice. A couple examples of these use cases:
My sense is that we should focus on keywords for now, but in the future we could consider support for some simple analysis/ normalization, pending feedback on the feature.
Why do we plan to index individual tokens alone?
It could be nice if users were able search an entire object field (e.g. {"headers": "en.wikipedia.org"}
), since the object's keys might be unknown or non-standardized. This is admittedly quite speculative, and partially just based on our thoughts on the original issue — I’ll try to collect more feedback here.
I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.
Right, that makes sense! I’ll just plan on a sanity check here.
I also wanted to clarify a point that was a bit fuzzy to me until I did a prototype. Under the proposed implementation, we are planning to index the entire JSON blob as a lucene field, and apply a special analyzer to create tokens that resemble keywords. This is in contrast to an approach where we create a new field mapping for each key in the object (which I think would be messy and negate some of the benefit of the feature).
Taking this approach means that in kibana and other clients, the field will be displayed as a single block of JSON. I created a quick example using the prototype implementation:
id
, name
, and image
mapped as keywords, and labels
as an 'indexed object'{"term": {"docker.container.labels.io.kubernetes.pod.name": "kafka-0"}}
As these JSON blobs can be quite large, highlighting seemed useful in showing where the match actually occurred. I also wonder if we should support highlighting for consistency with keyword fields, as some clients (like kibana) do depend on highlighting for displaying these matches?
I'm curious how this will work in practice as mixing up pre/post tags with a JSON structure sounds challenging? The other thing that worries me a bit is that if we want to support any highligter that is able to use indexed offsets or term vectors, then when we extract the JSON object from the source document at search time, we must make sure that it produces exactly the same string as what was passed to the object mapper so that offsets are comparable, and any new/removed spaces and line breaks or reordering of keys would break highlighting?
Now that I read my comment again, the latter doesn't make sense as we will not allow enabling term vectors or indexing offsets anyway, so the highlighter will have to recompute the matched offsets anyway.
I don't think I've dug enough into the details of highlighting to understand the concerns, but my takeaway is that it may be tricky to find a robust approach (and we should be open to punting on highlighting for v1).
As for the question around non-prefixed tokens, what do you think about this plan? Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity. We can mark the feature 'experimental' at first, to allow time to collect feedback about this flag, and also about analysis, a copy_to
mechanism, etc. Different people I've spoken with have had different intuitions on this point, and it's been hard to come to a good decision a priori.
Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity.
I agree that complexity of the implementation is fine, I'm more concerned about the API as we should strive to have as few switches as possible, especially for a v1. To me the question of this switch boils down to the problem that we are trying to solve: either we want to allow users to actually index objects, in which case indexing raw values makes sense, or we want to allow users to avoid the overhead of mappings and Lucene fields when indexing keywords and then it makes less sense?
That is a nice way to frame it! I am thinking of it as the former (providing a true 'object' field). I think it fits better with the use cases/ data we’ve seen, which center on indexing opaque JSON objects (metric beats, user-provided blobs of data, etc.). To me, the most compelling use for this feature is in being able to work with object data that is difficult to model otherwise, and not just saving on indexing cost when working with keywords. I will try to get some more consensus/ clarity on this point, and then loop back.
We had a discussion offline, and came to the following conclusions:
object
mapping. It will be important to allow for certain keys to be 'promoted' into their own dedicated fields, and we need to find a good mechanism to do so. One approach which we liked was to extend copy_to
to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object
with explicit subfield definitions.header.*
.@colings86 @romseygeek I’ve given some thought to naming and have laid out some options. It would be great to get your opinions as well.
Options I don’t think are very strong:
object
: clashes with the way we refer to traditional object mappings, and is actively causing confusion (see https://github.com/elastic/elasticsearch/pull/35063#issuecomment-434303654 for example).indexed_object
: seems verbose, and for me is not very intuitive. Could also cause confusion around how it relates to the current object
type.key_value
: this doesn’t accurately describe the input data, since it is a JSON blob and not a list of key-value pairs. At first I thought this name could fit, because it could refer to how we choose to model the data in Lucene. But it doesn’t really work, since we index 'keyless' leaf values and also create a single stored field for the whole JSON blob. One piece of evidence is that the names RootKeyValueFieldType
and KeyedKeyValueFieldType
are quite awkward.map
: this doesn’t describe the input data that accurately either, since it could imply a flat key-value structure. May be confusing given our use of the term mapping
.Current favorites:
json
: accurately describes the structure of the input data. One downside is that since the whole document is JSON, users may see this field and think they should always use it. I think we can mitigate this concern through clear documentation.blob
: kind of generic, but I like that it suggests the field contents are opaque and don’t need a pre-defined schema.dictionary
: has similar issues to map
about not accurately describing the structure of the input. However, it’s not easily confused with other terms.@jtibshirani I agree with you on the ones you list as "Options I don’t think are very strong".
On the "current favourites" I have the following thoughts:
json
: This is the one I am leaning towards at the moment, the downsides are definitely a factor but I agree that documentation should help hereblob
: I'm not so keen on this one as a blob
sounds like something we don't touch and is stored/indexed and retrieved as is which is what the binary
field type does so I think the name feels wrong to medictionary
: I agree that it has the same problems as map
, I'm not against this one but I don't love it eitherJust throwing out a few ideas. Don't really think any of them are winners, but may spark an idea elsewhere. :)
structure
/ structured
: field has some kind of structure to it that we minimally parse. Sorta like object
but without the overloaded use. deconstructed
/ dissected
/ extracted
: we deconstruct/dissect/cut open the json to find it's internal structureimplicit
/ implied
/ latent
: the field has some kind of implicit or hidden structure that we attempt to analyze into tokens@jtibshirani thanks again for letting me know the feature branch was ready to look at! I created an issue on the Kibana repo to start tracking our research.
At the moment the biggest issue I'm seeing is that Kibana has no way to know what sub fields might be present in the objects. This prevents us from autocompleting those field names in the query bar and it also prevents the user from creating filters (the pills below the query bar) on those fields because we currently present them with a dropdown to select the field, populated from our index pattern's field list. I realize this is sort of the point of the new type, but I'm wondering if ES could somehow track which sub field names it has seen and expose that information to Kibana? I think it would dramatically improve the user experience for querying on these fields.
Thanks @Bargs for taking a look, I have some questions that I will ping you about offline.
@jsoriano would you mind if we moved your question over to the original issue? I was hoping to keep the discussion here focused on implementation details as opposed to use cases.
A note to document the results of performance benchmarks. In summary, the results looked good overall, the only surprise was the small increase in index size when using an embedded_json
field.
For the testing set-up, I ran the metricbeat track on an n1-standard-8 GCP instance. In the baseline, the track is run without modifications, and all fields are mapped individually.
To test the performance of JSON fields, the object field system.process.cgroup
was changed to embedded_json
in the mappings. Some statistics about the field:
In the context of metricbeat data, system.process.cgroup
is not a perfect candidate for an embedded_json
field. If these benchmarks are added to our standard rally tracks, it would be good to extend the metricbeat data with a field like docker.container.labels
, which is a more natural fit for the field type.
Term Query
To test query performance, the following operation was added:
{
"name": "term_query",
"operation-type": "search",
"cache": false,
"body": {
"size": 50,
"query": {
"term": {
"system.process.cgroup.blkio.id": "runsvdir.service"
}
}
}
}
As expected, queries perform very similarly to the baseline, where the subfield had been mapped individually as a keyword
.
| Metric | Task | Baseline | Contender | Diff | Unit |
| 50th percentile service time | term_query | 20.0905 | 20.9434 | 0.8529 | ms |
| 90th percentile service time | term_query | 22.0021 | 24.3978 | 2.39568 | ms |
| 99th percentile service time | term_query | 60.7888 | 55.5902 | -5.19856 | ms |
| 100th percentile service time | term_query | 64.9998 | 56.051 | -8.9488 | ms |
Terms Aggregation
The following terms aggregation was also tested:
{
"name": "terms_agg",
"operation-type": "search",
"cache": false,
"body": {
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"blkio_ids": {
"terms": { "field": "system.process.cgroup.blkio.id" }
}
}
}
}
Terms aggregations were slower than the baseline, but the performance was still acceptable. From the profiling output, the bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact
, calling into GlobalOrdinalMapping#nextOrd
. This makes sense given the set-up, since each embedded JSON field system.process.cgroup
contains a large number of distinct key-value pairs that must be traversed before landing on the right key.
| Metric | Task | Baseline | Contender | Diff | Unit |
| 50th percentile service time | terms_agg | 31.028 | 45.4184 | 14.3904 | ms |
| 90th percentile service time | terms_agg | 33.0771 | 46.7582 | 13.6811 | ms |
| 99th percentile service time | terms_agg | 34.1942 | 50.3088 | 16.1146 | ms |
| 100th percentile service time | terms_agg | 40.26 | 50.9181 | 10.658 | ms |
Indexing Performance
Indexing throughput and service time looked good, the tests showed no decline in performance. To confirm the effect, I also repeated these indexing tests with an (unrealistic) set-up where all top-level fields were mapped as embedded_json
.
| Metric | Task | Baseline | Contender | Diff | Unit |
| Min Throughput | index-append | 966.739 | 952.316 | -14.4233 | docs/s |
| Median Throughput | index-append | 12036.5 | 13927.7 | 1891.22 | docs/s |
| Max Throughput | index-append | 14785.7 | 17560.6 | 2774.91 | docs/s |
| 50th percentile service time | index-append | 4553.08 | 3702.44 | -850.649 | ms |
| 90th percentile service time | index-append | 8235.1 | 7104.33 | -1130.77 | ms |
| 99th percentile service time | index-append | 12366 | 12074.3 | -291.721 | ms |
| 100th percentile service time | index-append | 12615.3 | 12994.3 | 378.96 | ms |
Index Size
In all tests I ran, index size actually increased by a small amount. This was a bit counterintuitive for me, as I had assumed that using embedded_json
could help save space by using a single field instead of multiple distinct ones. I'm guessing that the difference is due to the fact that in the baseline, many of the subfields are mapped as numbers, whereas with embedded_json
they are treated as keywords. As above, the effect was confirmed by repeating the tests with in a set-up where all top-level fields were mapped as embedded_json
.
| Metric | Task | Baseline | Contender | Diff | Unit |
| Index size | | 1.14885 | 1.22382 | 0.07497 | GB |
Thanks for testing, this looks great.
bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact, calling into GlobalOrdinalMapping#nextOrd
This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals and then calling nextOrd on the segment-level SortedSetDocValues
rather than GlobalOrdinalMapping
.
If you still have the index handy, I'd be curious to know what the difference is in terms of Lucene memory usage. I'm expecting a bit more, but I'd be curious to know how much more.
This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals...
That makes sense, I will take a look if it's possible to do this in a clean way. I think this benchmark presents a tough case for aggregations, in that each embedded_json
field contains many unique key-value pairs (in many use cases the field could contain far fewer).
As for lucene memory usage, here is the relevant output from the rally benchmarks. Let me know if there are any other measurements that would be interesting:
| Metric | Task | Baseline | Contender | Diff | Unit |
| Heap used for segments | | 2.03671 | 1.8357 | -0.20101 | MB |
| Heap used for doc values | | 1.02371 | 0.573326 | -0.45038 | MB |
| Heap used for terms | | 0.648928 | 0.921001 | 0.27207 | MB |
| Heap used for points | | 0.125407 | 0.104141 | -0.02127 | MB |
Thanks for running this test, I expected memory usage to be higher for embedded_json
, I'm happy to be proven wrong!
The initial version of the feature was merged in #42541 and backported to 7.3. I filed #43805 to track follow-up improvements.
Main issue: #25312 Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields
Note: this field type was previously called
embedded_json
, so many PRs + comments will refer to that name.Motivation
Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:
In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.
Feature Summary
This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field
header
of the form{"content-type": "text/html", "referer": "https://google.com"}
, its content will be analyzed into the individual tokenscontent-type\0text/html
,referer\0https://google.com
(where\0
is some suitable delimiter). Additionally, tokens are created for each value alone:text/html
,https://google.com
. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:
header
, value:application/json
, for example{"term": {"header": "application/json"}}
header.content-type
, value:application/json
, for example{"term": {"header.content-type": "application/json"}}
Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results:
{"term": {"header": "content-type\0application/json"}}
.As a first pass, the following query types will be allowed:
term
,terms
,terms_set
,range
(without special support for numerics),prefix
,match
family (insofar as they work for keyword fields),query_string
,simple_query_string
,exists
.In this first version, it will not be possible to refer to field keys using wildcards, as in
{"header.content-*": "application/json"}
. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.Potential Extensions
copy_to
to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.prefix_length
, we could likely supportwildcard
,regexp
, andfuzzy
queries.match_phrase
.Implementation Plan Core items:
{"header": "application/json"}
. #33923{"header.content-type": "application/json"}
. #34207 #34621embedded_json
. #40712