elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.85k stars 24.72k forks source link

Allow truncation of fields in search #72453

Open timroes opened 3 years ago

timroes commented 3 years ago

It would be useful if there would be an option to specify that all fields returned via the fields option in the search API should be truncated after a specific length. This is different from #60329 in so far that I am suggestion pure "output truncation" and not changing anything how the fields are stored/indexed in ES.

The use-case is that we're periodically running in Kibana into situations where users have stored too large text in their fields and we crash when trying to show that in Discover (see https://github.com/elastic/kibana/issues/98263 for an example). While we're offering the users a field filter in index pattern where they can exclude thus fields from being fetched at all, sometimes users don't know about that in advance, also it might just be individual documents containing a very long value. Thus it would be nice to apply a default truncation to all fields (most likely a rather high one, that just safe-guards us from crashing Kibana).

We also can't do that in Kibana, since we don't only fail when rendering, but we already often fail due to the too large response being handled/transformed/processed in our querying pipeline and causing to high memory consumption. Having fields truncated would also save a bit of bandwidth, though I'd consider that to be a secondary benefit.

I think it would be nice if we could specify a truncation length and ES would truncate the output and somehow indicate in the output that it got truncated, e.g. (though this is just one idea) via appending a (configurable) truncation token to the field value, that we can parse out again (similar to the highlight markers).

Related Kibana issue: https://github.com/elastic/kibana/issues/98497

This currently doesn't have a strong priority on Kibana side, but would be a nice-to-have performance improvement for our users.

gmmorris commented 3 years ago

I'd like to add some context here as this has impacted our ability to debug production issues on cloud, and I suspect customers will have hit this kind of thing too.

When supporting a customer recently we had to switch on verbose logging in their Kibana deployment. We then used Cloud's Monitoring cluster to view their logs in an attempt to identify the underlying issue. Sadly we found this impossible as Discover kept crashing on our own monitoring cluster, because certain log lines in the Kibana Server log contained a huge JSON document that had been stringified and repeated every couple of seconds.

The end result was that we couldn't investigate the logs of a cluster on our own cloud, and had to ask the Cloud team to fetch the raw logs from the server for us. This slowed down the support case and made it harder for us to identify the root cause of an issue that had completely crashed the customer's Kibana instances.

Handling this kind of issue gracefully in Kibana is extremely valuable, but as @timroes has pointed out, by the time this data reached Kibana it is often too late to prevent the negative impact.

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

jimczi commented 3 years ago

One amplification here is the server-side pagination in Kibana. Retrieving 500 hits all the time feels wasteful. Do we have plans to move the pagination to the server @timroes ? We might need to truncate big fields but it would be easier if we return only the hits that needs to be visible.

timroes commented 3 years ago

While both of those things (reducing field length and server side pagination) play both in the role of improvnig Discover context, I don't consider "one to solve the other", but treat them as separate problems. We can easily run into this case here while only loading 20 documents (because of server side pagination) and vice-versa run even with truncation in cases where only the pagination will help (e.g. with too many fields). Thus we should consider both of those solution and not consider one in favor of the other.

dadoonet commented 3 years ago

For some projects like FSCrawler, a lot of content could be generated. Think of indexing a big PDF document. When you use Kibana or whatever other tool, it makes the search a bit slow indeed as for each document you might retrieve over the network a lot of content. So having a way to truncate on elasticsearch side the content of a field could make sense.

I'm wondering if we could have a more generic tool like the ingest pipeline but as an output pipeline for all elasticsearch documents. That way we could let the users do whatever transformation they want by leveraging the pipeline feature and all the existing processors.

Then "truncation" could be a simple Script Processor call.

karlseguin commented 3 years ago

I believe this issue is related to https://github.com/elastic/kibana/issues/11457

I feel like the solution to this problem is to only select the displayed fields when building the table and lazily load the entire document once it's expanded. This is what datadog does and, for us, it's the difference between usable and not.

timroes commented 3 years ago

@karlseguin Yes that is the Kibana meta issue for this (I think I should bring that in a bit more structured form). There are multiple (non exclusive) solutions to address that problem. This one here is for tracking the requirements for field truncation, which is one part of the solution. The solution you mentioned is parallel to that, and tracked via https://github.com/elastic/kibana/issues/35787 (and won't require any changes on the Elasticsearch side).

jtibshirani commented 3 years ago

The problem statement makes sense to me. Even if Kibana is careful not to request too much data (sets a reasonable size, only fetches a few fields that are necessary), it can't anticipate that some result documents contain very large field values.

A question to better understand the context: how do we expect the user to interact with these very large fields? Was the field not that useful in the first place because it's unexpectedly way too large, so browsing a truncated result is enough? Or could they want to drill down into that single document and see the full value?

karlseguin commented 3 years ago

@jtibshirani We definitely always want the data. If we truncate it, it'll be at ingestion.

Kibana should always lazy load the expanded view of the document. Personally, I would start with this and see if more is needed. I think this will take care of 99% of complaints, and it's a relatively minor change.

For people who still have the issue: 1 - I expect most people would just remove the large field from the default list 2 - Kibana could even detect this and warn users about large fields displayed by default 3 - Still not ok? Then I'd look at what can be done in elastic search to support the case (e.g., configuration field truncation).

timroes commented 3 years ago

A question to better understand the context: how do we expect the user to interact with these very large fields? Was the field not that useful in the first place because it's unexpectedly way too large, so browsing a truncated result is enough? Or could they want to drill down into that single document and see the full value?

There are multiple different possibilities actually that I've seen happening. There are actually cases where the field itself altogether wasn't incredible use-ful and could be "regarded" as a field itself, but I haven't seen that list too often. The more two common cases I've seen:

Thus I don't think we need to be (from a requirements) side stricly for returning a truncated value, but would be fine having "no value" (but an identification that things got left out) for those fields.

mayya-sharipova commented 3 years ago

@timroes On a slightly tangential question, what is the biggest size of a search response that Kibana can process and Discover can display? We've seen users trying to store > 500Mb async search response, can Kibana handle to display these huge responses?

timroes commented 3 years ago

@mayya-sharipova I don't think we have a clear "limit" in mind for that, since it might highly depend on your local machine and browser configuration. Also there might be a very large transition area where it "just starts to get slow" and performance might be perceived very differently by users. That said even with a real good modern browser/computer, I think >500MB JSON responses will cause a significant issue in Kibana's performance that would be perceived by most users as unbearable and worse machines might more likely run into out of memory scenarios.

kertal commented 1 year ago

+1 for this nice to have, since from time to time users struggle with having large documents ingested, and this causes troubles using Kibana. Having a search result with plenty of large strings can cause issues depending on the spec of the host machine. I just triaged it with very large ingest data, a single document of 70MB (Thx @wwang500 for providing me doc 🦕). It took 110s to load and render. 94s the system was busy, I don't know what exactly it was doing, I guess Chrome was gasping for air / memory

Bildschirmfoto 2023-03-28 um 13 24 51

I'm aware it's an edge case, just wanted to reproduce what one of our users reported in https://github.com/elastic/kibana/issues/153363. They found a good solution by splitting up their data.

Looking beyond, 100 docs of 0.7 MB sound less an edge case, but would have similar performance issues. Having the ability to define a maxLength of text based fields in a search request, which would truncate the output, would be a helpful way to deal with it

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)