elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

Flattened object fields design + implementation #33003

Closed jtibshirani closed 5 years ago

jtibshirani commented 6 years ago

Main issue: #25312 Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields

Note: this field type was previously called embedded_json, so many PRs + comments will refer to that name.

Motivation

Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:

In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.

Feature Summary

This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field header of the form {"content-type": "text/html", "referer": "https://google.com"}, its content will be analyzed into the individual tokens content-type\0text/html, referer\0https://google.com (where \0 is some suitable delimiter). Additionally, tokens are created for each value alone: text/html, https://google.com. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.

In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:

Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results: {"term": {"header": "content-type\0application/json"}}.

As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), query_string, simple_query_string, exists.

In this first version, it will not be possible to refer to field keys using wildcards, as in {"header.content-*": "application/json"}. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.

Potential Extensions

Implementation Plan Core items:

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

jpountz commented 6 years ago

The initial example on #25312 suggests that object fields would be indexed like text since it indexes city: "New York" as [ "city:new", "city:york" ] but your plan here doesn't mention analysis and gives examples with fields that are usually indexed as keywords like referer and content-type. Which route do we want to follow? Keyword-like indexing could be achieved with a keyword analyzer, but I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.

Additionally, tokens are created for each value alone: text/html, https://google.com

Why do we plan to index individual tokens alone? I suspect most users won't want/need to search the entire object, meaning that this feature will double the size of the inverted index for a feature they don't need? Since you already mentioned having some sort of copy_to support, it could be a way that users who need this feature could copy the content of all fields to another catch-all field?

As a first pass, the following query types will be allowed: term, terms, terms_set, range (without special support for numerics), prefix, match family (insofar as they work for keyword fields), common, query_string, simple_query_string, exists. Highlighting will be supported.

If we go with keyword-style indexing, then we should probably skip highlighting, which is only useful with text fields? (matched queries are typically used instead for structured content)

Explore adding support for aggregations + sorting. This idea needs a lot more research, but could maybe be accomplished by creating additional 'doc value fields', then adding a filtering layer when fetching doc values that checks for the field prefix.

+1 I suspect it will be quite easy actually.

Create benchmarks for searching based on object keys.

That would be nice of course, but I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.

jtibshirani commented 6 years ago

Thanks @jpountz for your thoughts.

I feel like there is more ask for actual keywords which might enable support for aggregations in the future like you mentioned.

I agree — in the potential use cases we’ve seen, the data is better modelled as keywords than text. The most critical feature is the ability to filter by an exact match on key-value pairs, and performing aggregations and sorting on the values would also be nice. A couple examples of these use cases:

My sense is that we should focus on keywords for now, but in the future we could consider support for some simple analysis/ normalization, pending feedback on the feature.

Why do we plan to index individual tokens alone?

It could be nice if users were able search an entire object field (e.g. {"headers": "en.wikipedia.org"}), since the object's keys might be unknown or non-standardized. This is admittedly quite speculative, and partially just based on our thoughts on the original issue — I’ll try to collect more feedback here.

I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level.

Right, that makes sense! I’ll just plan on a sanity check here.

jtibshirani commented 6 years ago

I also wanted to clarify a point that was a bit fuzzy to me until I did a prototype. Under the proposed implementation, we are planning to index the entire JSON blob as a lucene field, and apply a special analyzer to create tokens that resemble keywords. This is in contrast to an approach where we create a new field mapping for each key in the object (which I think would be messy and negate some of the benefit of the feature).

Taking this approach means that in kibana and other clients, the field will be displayed as a single block of JSON. I created a quick example using the prototype implementation:

screen shot 2018-08-21 at 5 07 47 pm

As these JSON blobs can be quite large, highlighting seemed useful in showing where the match actually occurred. I also wonder if we should support highlighting for consistency with keyword fields, as some clients (like kibana) do depend on highlighting for displaying these matches?

jpountz commented 6 years ago

I'm curious how this will work in practice as mixing up pre/post tags with a JSON structure sounds challenging? The other thing that worries me a bit is that if we want to support any highligter that is able to use indexed offsets or term vectors, then when we extract the JSON object from the source document at search time, we must make sure that it produces exactly the same string as what was passed to the object mapper so that offsets are comparable, and any new/removed spaces and line breaks or reordering of keys would break highlighting?

jpountz commented 6 years ago

Now that I read my comment again, the latter doesn't make sense as we will not allow enabling term vectors or indexing offsets anyway, so the highlighter will have to recompute the matched offsets anyway.

jtibshirani commented 6 years ago

I don't think I've dug enough into the details of highlighting to understand the concerns, but my takeaway is that it may be tricky to find a robust approach (and we should be open to punting on highlighting for v1).

As for the question around non-prefixed tokens, what do you think about this plan? Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity. We can mark the feature 'experimental' at first, to allow time to collect feedback about this flag, and also about analysis, a copy_to mechanism, etc. Different people I've spoken with have had different intuitions on this point, and it's been hard to come to a good decision a priori.

jpountz commented 6 years ago

Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity.

I agree that complexity of the implementation is fine, I'm more concerned about the API as we should strive to have as few switches as possible, especially for a v1. To me the question of this switch boils down to the problem that we are trying to solve: either we want to allow users to actually index objects, in which case indexing raw values makes sense, or we want to allow users to avoid the overhead of mappings and Lucene fields when indexing keywords and then it makes less sense?

jtibshirani commented 6 years ago

That is a nice way to frame it! I am thinking of it as the former (providing a true 'object' field). I think it fits better with the use cases/ data we’ve seen, which center on indexing opaque JSON objects (metric beats, user-provided blobs of data, etc.). To me, the most compelling use for this feature is in being able to work with object data that is difficult to model otherwise, and not just saving on indexing cost when working with keywords. I will try to get some more consensus/ clarity on this point, and then loop back.

jtibshirani commented 6 years ago

We had a discussion offline, and came to the following conclusions:

jtibshirani commented 5 years ago

@colings86 @romseygeek I’ve given some thought to naming and have laid out some options. It would be great to get your opinions as well.

Options I don’t think are very strong:

Current favorites:

colings86 commented 5 years ago

@jtibshirani I agree with you on the ones you list as "Options I don’t think are very strong".

On the "current favourites" I have the following thoughts:

polyfractal commented 5 years ago

Just throwing out a few ideas. Don't really think any of them are winners, but may spark an idea elsewhere. :)

Bargs commented 5 years ago

@jtibshirani thanks again for letting me know the feature branch was ready to look at! I created an issue on the Kibana repo to start tracking our research.

At the moment the biggest issue I'm seeing is that Kibana has no way to know what sub fields might be present in the objects. This prevents us from autocompleting those field names in the query bar and it also prevents the user from creating filters (the pills below the query bar) on those fields because we currently present them with a dropdown to select the field, populated from our index pattern's field list. I realize this is sort of the point of the new type, but I'm wondering if ES could somehow track which sub field names it has seen and expose that information to Kibana? I think it would dramatically improve the user experience for querying on these fields.

jsoriano commented 5 years ago

(Moved to https://github.com/elastic/elasticsearch/issues/25312#issuecomment-442389905)

jtibshirani commented 5 years ago

Thanks @Bargs for taking a look, I have some questions that I will ping you about offline.

@jsoriano would you mind if we moved your question over to the original issue? I was hoping to keep the discussion here focused on implementation details as opposed to use cases.

jsoriano commented 5 years ago

@jtibshirani sure, moved.

jtibshirani commented 5 years ago

A note to document the results of performance benchmarks. In summary, the results looked good overall, the only surprise was the small increase in index size when using an embedded_json field.

For the testing set-up, I ran the metricbeat track on an n1-standard-8 GCP instance. In the baseline, the track is run without modifications, and all fields are mapped individually.

To test the performance of JSON fields, the object field system.process.cgroup was changed to embedded_json in the mappings. Some statistics about the field:

In the context of metricbeat data, system.process.cgroup is not a perfect candidate for an embedded_json field. If these benchmarks are added to our standard rally tracks, it would be good to extend the metricbeat data with a field like docker.container.labels, which is a more natural fit for the field type.

Term Query

To test query performance, the following operation was added:

{
  "name": "term_query",
  "operation-type": "search",
  "cache": false,
  "body": {
    "size": 50,
    "query": {
      "term": {
        "system.process.cgroup.blkio.id": "runsvdir.service"
      }
    }
  }
}

As expected, queries perform very similarly to the baseline, where the subfield had been mapped individually as a keyword.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|  50th percentile service time |   term_query |    20.0905 |     20.9434 |   0.8529 |     ms |
|  90th percentile service time |   term_query |    22.0021 |     24.3978 |  2.39568 |     ms |
|  99th percentile service time |   term_query |    60.7888 |     55.5902 | -5.19856 |     ms |
| 100th percentile service time |   term_query |    64.9998 |      56.051 |  -8.9488 |     ms |

Terms Aggregation

The following terms aggregation was also tested:

{
  "name": "terms_agg",
  "operation-type": "search",
  "cache": false,
  "body": {
    "size": 0,
    "query": {
      "match_all": {}
    },
    "aggs": {
      "blkio_ids": {
        "terms": { "field": "system.process.cgroup.blkio.id" }
      }
    }
  }
}

Terms aggregations were slower than the baseline, but the performance was still acceptable. From the profiling output, the bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact, calling into GlobalOrdinalMapping#nextOrd. This makes sense given the set-up, since each embedded JSON field system.process.cgroup contains a large number of distinct key-value pairs that must be traversed before landing on the right key.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|  50th percentile service time |    terms_agg |     31.028 |     45.4184 |  14.3904 |     ms |
|  90th percentile service time |    terms_agg |    33.0771 |     46.7582 |  13.6811 |     ms |
|  99th percentile service time |    terms_agg |    34.1942 |     50.3088 |  16.1146 |     ms |
| 100th percentile service time |    terms_agg |      40.26 |     50.9181 |   10.658 |     ms |

Indexing Performance

Indexing throughput and service time looked good, the tests showed no decline in performance. To confirm the effect, I also repeated these indexing tests with an (unrealistic) set-up where all top-level fields were mapped as embedded_json.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|                Min Throughput | index-append |    966.739 |     952.316 | -14.4233 | docs/s |
|             Median Throughput | index-append |    12036.5 |     13927.7 |  1891.22 | docs/s |
|                Max Throughput | index-append |    14785.7 |     17560.6 |  2774.91 | docs/s |
|  50th percentile service time | index-append |    4553.08 |     3702.44 | -850.649 |     ms |
|  90th percentile service time | index-append |     8235.1 |     7104.33 | -1130.77 |     ms |
|  99th percentile service time | index-append |      12366 |     12074.3 | -291.721 |     ms |
| 100th percentile service time | index-append |    12615.3 |     12994.3 |   378.96 |     ms |

Index Size

In all tests I ran, index size actually increased by a small amount. This was a bit counterintuitive for me, as I had assumed that using embedded_json could help save space by using a single field instead of multiple distinct ones. I'm guessing that the difference is due to the fact that in the baseline, many of the subfields are mapped as numbers, whereas with embedded_json they are treated as keywords. As above, the effect was confirmed by repeating the tests with in a set-up where all top-level fields were mapped as embedded_json.

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|                    Index size |              |    1.14885 |     1.22382 |  0.07497 |     GB |
jpountz commented 5 years ago

Thanks for testing, this looks great.

bulk of the time is spent in KeyedJsonAtomicFieldData#advanceExact, calling into GlobalOrdinalMapping#nextOrd

This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals and then calling nextOrd on the segment-level SortedSetDocValues rather than GlobalOrdinalMapping.

If you still have the index handy, I'd be curious to know what the difference is in terms of Lucene memory usage. I'm expecting a bit more, but I'd be curious to know how much more.

jtibshirani commented 5 years ago

This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals...

That makes sense, I will take a look if it's possible to do this in a clean way. I think this benchmark presents a tough case for aggregations, in that each embedded_json field contains many unique key-value pairs (in many use cases the field could contain far fewer).

As for lucene memory usage, here is the relevant output from the rally benchmarks. Let me know if there are any other measurements that would be interesting:

|                        Metric |         Task |   Baseline |   Contender |     Diff |   Unit |
|        Heap used for segments |              |    2.03671 |      1.8357 | -0.20101 |     MB |
|      Heap used for doc values |              |    1.02371 |    0.573326 | -0.45038 |     MB |
|           Heap used for terms |              |   0.648928 |    0.921001 |  0.27207 |     MB |
|          Heap used for points |              |   0.125407 |    0.104141 | -0.02127 |     MB |
jpountz commented 5 years ago

Thanks for running this test, I expected memory usage to be higher for embedded_json, I'm happy to be proven wrong!

jtibshirani commented 5 years ago

The initial version of the feature was merged in #42541 and backported to 7.3. I filed #43805 to track follow-up improvements.