Flattened object fields design + implementation

jtibshirani commented 6 years ago

Main issue: #25312 Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields

Note: this field type was previously called embedded_json, so many PRs + comments will refer to that name.

Motivation

Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:

We’re creating a large number of distinct fields in Lucene.
Each field becomes its own entry in the mappings, which can lead to a large cluster state.
From a UX perspective, the list of fields can appear quite cluttered, and it can be difficult to understand which fields are most critical.

In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.

Feature Summary

This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field header of the form {"content-type": "text/html", "referer": "https://google.com"}, its content will be analyzed into the individual tokens content-type\0text/html, referer\0https://google.com (where \0 is some suitable delimiter). Additionally, tokens are created for each value alone: text/html, https://google.com. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.

In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:

key: header, value: application/json, for example {"term": {"header": "application/json"}}
key: header.content-type, value: application/json, for example {"term": {"header.content-type": "application/json"}}

Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results: {"term": {"header": "content-type\0application/json"}}.

As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), query_string, simple_query_string, exists.

In this first version, it will not be possible to refer to field keys using wildcards, as in {"header.content-*": "application/json"}. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.

Potential Extensions

Collect more feedback on the importance of numeric fields, and explore adding more targeted support for them. As an example, users may want to perform true range queries on numeric fields within the object.
Introduce a way for certain JSON keys to be 'promoted' into individual fields. One approach we're considering is to extend copy_to to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.
Add support for additional query types.
- By performing proper escaping and taking advantage of prefix_length, we could likely support wildcard, regexp, and fuzzy queries.
- As mentioned in the original issue, we could consider tokenizing values on whitespace. This could allow for better support of positional queries like match_phrase.
Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix. Update: we've decided to include this in the first version.
Explore adding support for highlighting, since with large JSON blobs it can be difficult to tell which key-value pairs matched the query.
Potentially allow for the field contents to be specified as a JSON string, in addition to accepting an object embedded in the document source.

Implementation Plan Core items:

[x] Create a new field type that accepts an object and indexes its leaf values. Verify that the object field can be used in queries of the form {"header": "application/json"}. #33923
[x] Index prefixed tokens, and support searching for values based on key as in {"header.content-type": "application/json"}. #34207 #34621
[x] Add support for storing the field by adding a single stored field containing the whole JSON blob. #34942
[x] Add a limit to the depth of the objects that will be indexed. #35063
[x] Add documentation. #35281
[x] Add tests for the supported query types. #35319
[x] Revisit the field lookup logic with potential optimizations. #39872
[x] Rename field type to embedded_json. #40712
[x] Explore adding support for doc values, to allow for aggregations + sorting. #40069
[x] Address issues in doc values implementation (https://github.com/elastic/elasticsearch/pull/40069#issuecomment-477797499). #41282 #41319
[x] Perform benchmarks.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

jpountz commented 6 years ago

The initial example on #25312 suggests that object fields would be indexed like text since it indexes city: "New York" as [ "city:new", "city:york" ] but your plan here doesn't mention analysis and gives examples with fields that are usually indexed as keywords like referer and content-type. Which route do we want to follow? Keyword-like indexing could be achieved with a keyword analyzer, but I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.

Additionally, tokens are created for each value alone: text/html, https://google.com

Why do we plan to index individual tokens alone? I suspect most users won't want/need to search the entire object, meaning that this feature will double the size of the inverted index for a feature they don't need? Since you already mentioned having some sort of copy_to support, it could be a way that users who need this feature could copy the content of all fields to another catch-all field?

As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), common, query_string, simple_query_string, exists. Highlighting will be supported.

If we go with keyword-style indexing, then we should probably skip highlighting, which is only useful with text fields? (matched queries are typically used instead for structured content)

Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix.

+1 I suspect it will be quite easy actually.

Create benchmarks for searching based on object keys.

That would be nice of course, but I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.

jtibshirani commented 6 years ago

Thanks @jpountz for your thoughts.

I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.

I agree — in the potential use cases we’ve seen, the data is better modelled as keywords than text. The most critical feature is the ability to filter by an exact match on key-value pairs, and performing aggregations and sorting on the values would also be nice. A couple examples of these use cases:

A user can have a number of metricbeats, each of which contains a large set of unique metrics. Only a small number of these fields are regularly used for filters, aggregations, etc. while the rest are only useful for reference when looking at a document, or in a rare case when filtering down to dig into an unusual issue.
Swiftype indexes logs of HTTP requests + responses, which contain a large number of arbitrary HTTP headers.

My sense is that we should focus on keywords for now, but in the future we could consider support for some simple analysis/ normalization, pending feedback on the feature.

Why do we plan to index individual tokens alone?

It could be nice if users were able search an entire object field (e.g. {"headers": "en.wikipedia.org"}), since the object's keys might be unknown or non-standardized. This is admittedly quite speculative, and partially just based on our thoughts on the original issue — I’ll try to collect more feedback here.

I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.

Right, that makes sense! I’ll just plan on a sanity check here.

jtibshirani commented 6 years ago

I also wanted to clarify a point that was a bit fuzzy to me until I did a prototype. Under the proposed implementation, we are planning to index the entire JSON blob as a lucene field, and apply a special analyzer to create tokens that resemble keywords. This is in contrast to an approach where we create a new field mapping for each key in the object (which I think would be messy and negate some of the benefit of the feature).

Taking this approach means that in kibana and other clients, the field will be displayed as a single block of JSON. I created a quick example using the prototype implementation:

the data consists of a filebeat for a docker container, with id, name, and image mapped as keywords, and labels as an 'indexed object'
we apply the search filter {"term": {"docker.container.labels.io.kubernetes.pod.name": "kafka-0"}}

screen shot 2018-08-21 at 5 07 47 pm

As these JSON blobs can be quite large, highlighting seemed useful in showing where the match actually occurred. I also wonder if we should support highlighting for consistency with keyword fields, as some clients (like kibana) do depend on highlighting for displaying these matches?

jpountz commented 6 years ago

I'm curious how this will work in practice as mixing up pre/post tags with a JSON structure sounds challenging? The other thing that worries me a bit is that if we want to support any highligter that is able to use indexed offsets or term vectors, then when we extract the JSON object from the source document at search time, we must make sure that it produces exactly the same string as what was passed to the object mapper so that offsets are comparable, and any new/removed spaces and line breaks or reordering of keys would break highlighting?

jpountz commented 6 years ago

Now that I read my comment again, the latter doesn't make sense as we will not allow enabling term vectors or indexing offsets anyway, so the highlighter will have to recompute the matched offsets anyway.

jtibshirani commented 6 years ago

I don't think I've dug enough into the details of highlighting to understand the concerns, but my takeaway is that it may be tricky to find a robust approach (and we should be open to punting on highlighting for v1).

As for the question around non-prefixed tokens, what do you think about this plan? Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity. We can mark the feature 'experimental' at first, to allow time to collect feedback about this flag, and also about analysis, a copy_to mechanism, etc. Different people I've spoken with have had different intuitions on this point, and it's been hard to come to a good decision a priori.

jpountz commented 6 years ago

Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity.

I agree that complexity of the implementation is fine, I'm more concerned about the API as we should strive to have as few switches as possible, especially for a v1. To me the question of this switch boils down to the problem that we are trying to solve: either we want to allow users to actually index objects, in which case indexing raw values makes sense, or we want to allow users to avoid the overhead of mappings and Lucene fields when indexing keywords and then it makes less sense?

jtibshirani commented 6 years ago

That is a nice way to frame it! I am thinking of it as the former (providing a true 'object' field). I think it fits better with the use cases/ data we’ve seen, which center on indexing opaque JSON objects (metric beats, user-provided blobs of data, etc.). To me, the most compelling use for this feature is in being able to work with object data that is difficult to model otherwise, and not just saving on indexing cost when working with keywords. I will try to get some more consensus/ clarity on this point, and then loop back.

jtibshirani commented 6 years ago

We had a discussion offline, and came to the following conclusions:

In this first version, we’ll create a new leaf field type instead of incorporating the functionality into the object mapping. It will be important to allow for certain keys to be 'promoted' into their own dedicated fields, and we need to find a good mechanism to do so. One approach which we liked was to extend copy_to to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.
Raw values will be indexed in addition to key-value pairs. This will ensure consistent behavior for queryable object fields when the user performs a query across all mapped fields, or performs a wildcard search of the form header.*.

jtibshirani commented 5 years ago

@colings86 @romseygeek I’ve given some thought to naming and have laid out some options. It would be great to get your opinions as well.

Options I don’t think are very strong:

object: clashes with the way we refer to traditional object mappings, and is actively causing confusion (see https://github.com/elastic/elasticsearch/pull/35063#issuecomment-434303654 for example).
indexed_object: seems verbose, and for me is not very intuitive. Could also cause confusion around how it relates to the current object type.
key_value: this doesn’t accurately describe the input data, since it is a JSON blob and not a list of key-value pairs. At first I thought this name could fit, because it could refer to how we choose to model the data in Lucene. But it doesn’t really work, since we index 'keyless' leaf values and also create a single stored field for the whole JSON blob. One piece of evidence is that the names RootKeyValueFieldType and KeyedKeyValueFieldType are quite awkward.
map: this doesn’t describe the input data that accurately either, since it could imply a flat key-value structure. May be confusing given our use of the term mapping.

Current favorites:

json: accurately describes the structure of the input data. One downside is that since the whole document is JSON, users may see this field and think they should always use it. I think we can mitigate this concern through clear documentation.
blob: kind of generic, but I like that it suggests the field contents are opaque and don’t need a pre-defined schema.
dictionary: has similar issues to map about not accurately describing the structure of the input. However, it’s not easily confused with other terms.

colings86 commented 5 years ago

@jtibshirani I agree with you on the ones you list as "Options I don’t think are very strong".

On the "current favourites" I have the following thoughts:

json: This is the one I am leaning towards at the moment, the downsides are definitely a factor but I agree that documentation should help here
blob: I'm not so keen on this one as a blob sounds like something we don't touch and is stored/indexed and retrieved as is which is what the binary field type does so I think the name feels wrong to me
dictionary: I agree that it has the same problems as map, I'm not against this one but I don't love it either

polyfractal commented 5 years ago

Just throwing out a few ideas. Don't really think any of them are winners, but may spark an idea elsewhere. :)

structure / structured: field has some kind of structure to it that we minimally parse. Sorta like object but without the overloaded use.
deconstructed / dissected / extracted: we deconstruct/dissect/cut open the json to find it's internal structure
implicit / implied / latent: the field has some kind of implicit or hidden structure that we attempt to analyze into tokens

Bargs commented 5 years ago

@jtibshirani thanks again for letting me know the feature branch was ready to look at! I created an issue on the Kibana repo to start tracking our research.

At the moment the biggest issue I'm seeing is that Kibana has no way to know what sub fields might be present in the objects. This prevents us from autocompleting those field names in the query bar and it also prevents the user from creating filters (the pills below the query bar) on those fields because we currently present them with a dropdown to select the field, populated from our index pattern's field list. I realize this is sort of the point of the new type, but I'm wondering if ES could somehow track which sub field names it has seen and expose that information to Kibana? I think it would dramatically improve the user experience for querying on these fields.

jsoriano commented 5 years ago

(Moved to https://github.com/elastic/elasticsearch/issues/25312#issuecomment-442389905)

jtibshirani commented 5 years ago

Thanks @Bargs for taking a look, I have some questions that I will ping you about offline.

@jsoriano would you mind if we moved your question over to the original issue? I was hoping to keep the discussion here focused on implementation details as opposed to use cases.

jsoriano commented 5 years ago

@jtibshirani sure, moved.

jtibshirani commented 5 years ago

A note to document the results of performance benchmarks. In summary, the results looked good overall, the only surprise was the small increase in index size when using an embedded_json field.

For the testing set-up, I ran the metricbeat track on an n1-standard-8 GCP instance. In the baseline, the track is run without modifications, and all fields are mapped individually.

To test the performance of JSON fields, the object field system.process.cgroup was changed to embedded_json in the mappings. Some statistics about the field:

appears in ~42,000 out of 1,079,600 total documents
in total, the field has ~130,000 unique key-value pairs
each instance of the field in a document contains ~80 distinct key-value pairs.

In the context of metricbeat data, system.process.cgroup is not a perfect candidate for an embedded_json field. If these benchmarks are added to our standard rally tracks, it would be good to extend the metricbeat data with a field like docker.container.labels, which is a more natural fit for the field type.

Term Query

To test query performance, the following operation was added:

{
  "name": "term_query",
  "operation-type": "search",
  "cache": false,
  "body": {
    "size": 50,
    "query": {
      "term": {
        "system.process.cgroup.blkio.id": "runsvdir.service"
      }
    }
  }
}

As expected, queries perform very similarly to the baseline, where the subfield had been mapped individually as a keyword.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|  50th percentile service time |   term_query |    20.0905 |     20.9434 |   0.8529 |     ms |
|  90th percentile service time |   term_query |    22.0021 |     24.3978 |  2.39568 |     ms |
|  99th percentile service time |   term_query |    60.7888 |     55.5902 | -5.19856 |     ms |
| 100th percentile service time |   term_query |    64.9998 |      56.051 |  -8.9488 |     ms |

Terms Aggregation

The following terms aggregation was also tested:

{
  "name": "terms_agg",
  "operation-type": "search",
  "cache": false,
  "body": {
    "size": 0,
    "query": {
      "match_all": {}
    },
    "aggs": {
      "blkio_ids": {
        "terms": { "field": "system.process.cgroup.blkio.id" }
      }
    }
  }
}

Terms aggregations were slower than the baseline, but the performance was still acceptable. From the profiling output, the bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact, calling into GlobalOrdinalMapping#nextOrd. This makes sense given the set-up, since each embedded JSON field system.process.cgroup contains a large number of distinct key-value pairs that must be traversed before landing on the right key.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|  50th percentile service time |    terms_agg |     31.028 |     45.4184 |  14.3904 |     ms |
|  90th percentile service time |    terms_agg |    33.0771 |     46.7582 |  13.6811 |     ms |
|  99th percentile service time |    terms_agg |    34.1942 |     50.3088 |  16.1146 |     ms |
| 100th percentile service time |    terms_agg |      40.26 |     50.9181 |   10.658 |     ms |

Indexing Performance

Indexing throughput and service time looked good, the tests showed no decline in performance. To confirm the effect, I also repeated these indexing tests with an (unrealistic) set-up where all top-level fields were mapped as embedded_json.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|                Min Throughput | index-append |    966.739 |     952.316 | -14.4233 | docs/s |
|             Median Throughput | index-append |    12036.5 |     13927.7 |  1891.22 | docs/s |
|                Max Throughput | index-append |    14785.7 |     17560.6 |  2774.91 | docs/s |
|  50th percentile service time | index-append |    4553.08 |     3702.44 | -850.649 |     ms |
|  90th percentile service time | index-append |     8235.1 |     7104.33 | -1130.77 |     ms |
|  99th percentile service time | index-append |      12366 |     12074.3 | -291.721 |     ms |
| 100th percentile service time | index-append |    12615.3 |     12994.3 |   378.96 |     ms |

Index Size

In all tests I ran, index size actually increased by a small amount. This was a bit counterintuitive for me, as I had assumed that using embedded_json could help save space by using a single field instead of multiple distinct ones. I'm guessing that the difference is due to the fact that in the baseline, many of the subfields are mapped as numbers, whereas with embedded_json they are treated as keywords. As above, the effect was confirmed by repeating the tests with in a set-up where all top-level fields were mapped as embedded_json.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|                    Index size |              |    1.14885 |     1.22382 |  0.07497 |     GB |

jpountz commented 5 years ago

Thanks for testing, this looks great.

bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact, calling into GlobalOrdinalMapping#nextOrd

This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals and then calling nextOrd on the segment-level SortedSetDocValues rather than GlobalOrdinalMapping.

If you still have the index handy, I'd be curious to know what the difference is in terms of Lucene memory usage. I'm expecting a bit more, but I'd be curious to know how much more.

jtibshirani commented 5 years ago

This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals...

That makes sense, I will take a look if it's possible to do this in a clean way. I think this benchmark presents a tough case for aggregations, in that each embedded_json field contains many unique key-value pairs (in many use cases the field could contain far fewer).

As for lucene memory usage, here is the relevant output from the rally benchmarks. Let me know if there are any other measurements that would be interesting:

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|        Heap used for segments |              |    2.03671 |      1.8357 | -0.20101 |     MB |
|      Heap used for doc values |              |    1.02371 |    0.573326 | -0.45038 |     MB |
|           Heap used for terms |              |   0.648928 |    0.921001 |  0.27207 |     MB |
|          Heap used for points |              |   0.125407 |    0.104141 | -0.02127 |     MB |

jpountz commented 5 years ago

Thanks for running this test, I expected memory usage to be higher for embedded_json, I'm happy to be proven wrong!

jtibshirani commented 5 years ago

The initial version of the feature was merged in #42541 and backported to 7.3. I filed #43805 to track follow-up improvements.

elastic / elasticsearch

Flattened object fields design + implementation #33003