elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
25 stars 438 forks source link

Set `index: false` on fields that are rarely used for filtering #3419

Closed jpountz closed 1 month ago

jpountz commented 2 years ago

Elasticsearch 8.1 introduced a feature that we refer to as doc-value-only fields. The benefit of this feature is that keyword, date, ip, boolean, geo_point and numeric fields that have index set to false can still be used for filtering and report as searchable in Kibana, ie. this mapping change is completely transparent to Kibana. Effectively, setting index to false makes Elasticsearch perform like a columnar data store. While queries do not have index structures to take advantage of, they remain relatively efficient thanks to the I/O efficiency and fast decoding of Elasticsearch's columnar storage (doc values). This enables users to trade a bit search performance for reduced index-time CPU and lower disk usage, since inverted index structures no longer need to be computed at index time and stored on disk. As highlighted in the blog post, setting index: false triggered a 20% indexing speedup as well as a 20% index size reduction on a dataset we benchmarked against.

Notes:

My intuition is that the majority of fields that our integrations produce are never used for filtering. Setting index: false on these fields would both improve ingestion rates and reduce disk usage, without affecting search performance. In the long term, we might want to change the way that we are thinking of it, mapping fields as doc-value-only fields by default, and only indexing important dimensions of our datasets, such as host.name and @timestamp on Logging datasets.

Among potential candidates for doc-value-only mappings, gauges and counters feel especially safe, as their values mostly make sense in aggregates, not in isolation. This makes index: true pretty useless on these fields. Maybe a first rollout could consist of giving doc-value-only mappings to gauges and counters of our metrics integrations?

jsoriano commented 2 years ago

Maybe a first rollout could consist of giving doc-value-only mappings to gauges and counters of our metrics integrations?

We have been recently discussing about adding feature flags to Fleet to help trying this kind of things. I think it could make sense to have a flag to set index: false by default on gauges and counters.

cc @joshdover @elastic/ecosystem

ruflin commented 2 years ago

What happens if someone sets a dimension field to index: false in the context of TSDB. Should dimension fields be required to be indexed? @nik9000 you might be able to help?

joshdover commented 2 years ago

I think it could make sense to have a flag to set index: false by default on gauges and counters.

Makes sense to me to implement this as a default, though I also think we may need a way for package authors to override the default behavior in some cases. How would we identify gauge and counter fields today?

csoulios commented 2 years ago

What happens if someone sets a dimension field to index: false in the context of TSDB. Should dimension fields be required to be indexed?

Current implementation requires that dimension fields are indexed and have doc_values. If you set "index": false for a dimension field, you will get a validation error.

See relevant code for keyword field validation

nik9000 commented 2 years ago

The implementation for dimensions fields that dimension fields are indexed and have doc_values. If you set "index": false for a dimension field, you will get a validation error.

Yeah, that. Our expectation is that folks will filter on the time_series_dimensions. We might want to soften that some day, but I haven't thought enough about it to have a super strong opinion here.

jpountz commented 2 years ago

For context, this question was not coming from the perspective of relaxing this check, is was coming from the perspective of "if we do something that is wrong in our integrations, how likely is Elasticsearch to complain about it?".

jpountz commented 2 years ago

For reference I opened a PR that disables indexing on number fields on default templates for metrics data: https://github.com/elastic/elasticsearch/pull/87100.

jpountz commented 2 years ago

The above PR is now merged and will be in 8.4.

joshdover commented 2 years ago

@jsoriano What do you think about expanding the scope of https://github.com/elastic/kibana/issues/132818 to include this functionality as well, potentially as a separate flag. This would allow us to validate the tradeoffs of doc-value-only fields before rolling it out as the default.

jsoriano commented 2 years ago

@joshdover yes, it looks like we are going to need some feature flags, so far we have at least TSDB mode, synthetic source and this index disabling. It'd be great if all these flags are available in a consistent way.

andresrc commented 2 years ago

Hi @jpountz ,

A couple of questions related to once https://github.com/elastic/elasticsearch/pull/87100 is available:

Thanks.

jpountz commented 2 years ago

@andresrc The change will apply to numeric fields of metrics-*-* data streams that use the built-in template for metrics. So in practice, this would apply when Agent pushes metrics data, but Fleet did not install an index template with a higher priority. I know the default index template for logs gets used for custom logs data sources, I'm less clear how common it is for the default index template for metrics to be used in practice with Agent, maybe @joshdover or @felixbarny have more details.

What would be the changes to make in the mappings, if any, to leverage doc-value-only fields?

You need to set index: false on a field of type numeric / date / ip / keyword and it will automatically use doc values to run queries.

felixbarny commented 2 years ago

This will require changes in the field mapping for all metric-type integrations. That's because our integrations almost exclusively rely on explicit mappings rather than dynamic mappings for metrics.

For example, the substatus metric mappings in the the Nginx integration have to be updated to set (index: false):

https://github.com/elastic/integrations/blob/a88cdda534598bad20652f43e18ec24bfc2d20fe/packages/nginx/data_stream/stubstatus/fields/fields.yml#L8-L11

Here's another example in the prometheus integration, but I'm not familiar with that syntax (such as object_type and the wildcard in the name) and how it translates to Elasticsearch mappings. https://github.com/elastic/integrations/blob/a88cdda534598bad20652f43e18ec24bfc2d20fe/packages/prometheus/data_stream/collector/fields/fields.yml#L22-L24

jsoriano commented 2 years ago

This will require changes in the field mapping for all metric-type integrations.

We are considering making Fleet to enable this by default in all numeric fields. First behind an option, but later we may do it everywhere if we see that the benefits outweight possible performance regressions, see https://github.com/elastic/kibana/issues/132818.

Here's another example in the prometheus integration, but I'm not familiar with that syntax (such as object_type and the wildcard in the name) and how it translates to Elasticsearch mappings.

They should translate to dynamic mappings, using the name with the wildcard as path_match, but this is not correctly working, there is an open issue about this: https://github.com/elastic/kibana/issues/129344

joshdover commented 2 years ago

Opened an issue for adding an "experimental" toggle for this, similar to how we did for synthetic source: https://github.com/elastic/kibana/issues/144357

ruflin commented 1 year ago

@kpollich I would really like to see this moving forward too in the context of https://github.com/elastic/kibana/issues/132818 Elasticsearch already uses it in its default templates since 8.4: https://github.com/elastic/integrations/issues/3419#issuecomment-1154802792 @jpountz Any complaints so far?

ruflin commented 1 year ago

@kpollich Any changes / plans on this front? For now assigned it to you as I think all the changes are now in the hands of the Fleet team.

elasticmachine commented 1 year ago

Pinging @elastic/fleet (Team:Fleet)

kpollich commented 1 year ago

Hi @ruflin - we haven't looked at this in any more detail recently, but @jsoriano's comment above still matches my understanding here. We have the experimental toggles for doc-values-only implemented, and could do some benchmarking to see if it's worth enabling this behavior by default for all numeric fields, but we haven't prioritized that work.

cc @nimarezainia @mukeshelastic

ruflin commented 1 year ago

We have the experimental toggles for doc-values-only implemented, and could do some benchmarking to see if it's worth enabling this behavior by default for all numeric fields, but we haven't prioritized that work.

I wonder if this even needed as this issue is filed by the Elasticsearch because it is worth enabling was made the default in Elasticsearch for metrics-- in 8.4!

felixbarny commented 1 year ago

IIRC, time_series_metric fields in TSDS are doc-value-only.

joshdover commented 1 year ago

I'm comfortable making this change without the required benchmarks, if Observability tech leads are comfortable with it. Note this will mostly affect Observability integrations, so we need to communicate this widely to all integration developers and O11y PMs before we release this.

@jen-huang this seems like low-hanging fruit from our side which could have a large impact on metrics storage costs. Could we prioritize this in an upcoming sprint?

joshdover commented 1 year ago

One thing - do we think this will be surprising for customers if this happens during a stack upgrade? I'd prefer it be explicitly part of a package upgrade and displayed in the changelog we show in the UI.

Should we add this as a top-level option on the data_stream/manifest.yml files in packages to rolling out the change this way?

ruflin commented 1 year ago

One thing - do we think this will be surprising for customers if this happens during a stack upgrade? I'd prefer it be explicitly part of a package upgrade and displayed in the changelog we show in the UI.

Agree, I prefer it happening on a package upgrade over stack upgrade.

Should we add this as a top-level option on the data_stream/manifest.yml files in packages to rolling out the change this way? The catch here is that if we do it that way, I rather would directly convert these packages to TSDB and then we also get all these benefits by default. The idea is that some of the benefits we get also for packages that are not adjusted to TSDB yet.

I'm pretty confident that this will not have an impact for the majority of the use cases. As usual, there will be exceptions like http error code 200 which will likely be used for filtering. As with the previous features, ideally a user can opt-out. There is always the @custom option or we could have a toggle.

ruflin commented 1 year ago

With the risk of paddling back on this: It is a change that ideally we would have done a year ago. TSDS was not ready and it was our best option to improve storage cost. But now TSDS is around and can be used in packages. Maybe instead of spending time on this one, we should invest the time in having the best TSDS experience in Fleet and speed up adoption of TSDB in our packages, this will also solve any unexpected issue as the change is per package.

To be more precise on the improvements I'm thinking of is the discussion we had around making the toogles better / go away eventually:

Screenshot 2023-04-04 at 10 54 54

@kpollich Do we have a Github issue for this? If yes, can you reference it here?

ruflin commented 1 year ago

Following proposal to get this moving: Fleet adapts for the dynamic templates the same defaults as Elasticsearch. Everything else stays as is. It is up to the integration packages to adopt TSDB. The effect of this is that the default mapping for all metrics changes that don't have a specific mapping set.

The current metrics default templates from Elasticsearch can be found here: https://github.com/elastic/elasticsearch/pull/87100 and look as following:

      "dynamic_templates": [
        {
          "long_metrics": {
            "match_mapping_type": "long",
            "mapping": {
              "type": "long",
              "index": false
            }
          }
        },
        {
          "double_metrics": {
            "match_mapping_type": "double",
            "mapping": {
              "type": "float",
              "index": false
            }
          }
        }
      ],

In Fleet, the dynamic templates would have to be expanded somewhere around here: https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/epm/elasticsearch/template/template.ts#L126 with the above propoerties. What it means is that next time a package would upgrade, it would get these new dynamic mappings. For most data streams, this would very likely have no affect but benefits all the metric data streams that don't specify each field.

Important: This change is only for metrics-*-*.

kpollich commented 1 year ago

To be more precise on the improvements I'm thinking of is the discussion we had around making the toogles better / go away eventually: @kpollich Do we have a Github issue for this? If yes, can you reference it here?

https://github.com/elastic/kibana/issues/150913

I think it'd be best to move these toggles to stack management's component template editor UI since that's the actual underlying asset we're editing here. Separate conversation, though.

botelastic[bot] commented 7 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!