elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.61k stars 8.22k forks source link

Suggestion: Crash Kibana on unhandled promise rejection #77469

Open watson opened 4 years ago

watson commented 4 years ago

Currently, Kibana will exit with exit code 1 in case of an unhandled promise rejection during CI or development. If Kibana is started with the --no-warnings flag (which we do automatically in production when running bin/kibana), then Kibana will simply log a warning-level message to the logs.

However, an unhandled rejection should IMO be considered a program error and the program is now in an unknown state. It's therefore advisable to crash the program. This functionality is, by the way, going to be the default behaviour in Node.js starting from version 15.

I therefore suggest that we make this the default behaviour of Kibana even when running in production. We should possibly target this for Kibana v8.0.0 in case someone relies on this functionality (which I hope no one does).

elasticmachine commented 4 years ago

Pinging @elastic/kibana-operations (Team:Operations)

elasticmachine commented 4 years ago

Pinging @elastic/kibana-platform (Team:Platform)

rudolf commented 4 years ago

I agree that we should rather crash from v8.x

watson commented 4 years ago

I just had a chat with @legrego about this and we had the following thoughts:

pgayvallet commented 2 years ago

@watson I guess we missed the 8.0 train here. Given this is now the default behavior in dev mode, do we consider this good enough, or are we still planning on having this behavior in production mode too?

rudolf commented 2 years ago

There was some discussion about this for 7.16 in https://github.com/elastic/kibana/issues/98650 and the numbers weren't very good in 7.x

Currently 1.6% of 8.x clusters on cloud got an unhandled rejection over a 4 week period. On any given day about 1% of clusters have an unhandled rejection. There are a couple of outlier clusters with multiple rejections per day which we need to investigate but I feel like these numbers might be good enough to start transitioning towards crashing.

We should aim to roll this out incrementally:

  1. Add a deprecation warning recommending that users remove the --no-warnings flag (can we even detect this from JS?)
  2. Remove the flag on Elastic's internal clusters
  3. Do an incremental roll-out on cloud

As a first step we need to make it possible to opt out of --no-warnings. Since this flag is added by the kibana startup script users would have to edit the script to remove it. Perhaps we can remove this flag from the script and add it to config/node.options instead. That way users can remove the flag by changing configuration files. In order to do this we would need to modify the existing config/node.options file to add the flag if it's not present.

pgayvallet commented 2 years ago

I feel like these numbers might be good enough to start transitioning towards crashing.

Apart from decreasing the overall user experience and generating frustration, I'm still not sure to see the upsides of having Kibana crash in case of unhandled promise rejection instead of printing a warning to be absolutely honest. Maybe someone can enlighten me here? Did we identified cases where Kibana is effectively in an unknown state after an unhandled rejection where we would have preferred the process to terminate instead?

rudolf commented 2 years ago

Maybe someone can enlighten me here? Did we identified cases where Kibana is effectively in an unknown state after an unhandled rejection where we would have preferred the process to terminate instead?

We have had SDH's where the primary symptom was "kibana is slow" and continued memory growth that looked like a memory leak. After investigation the memory wasn't consumed by the heap but by Nodejs "external memory". External memory doesn't show up in heap snapshots, but we did see a huge amount of open file descriptors (IIRC 30k outgoing socket connections) while Kibana was doing just a small amount of traffic.

I only later realised that this is one of the potential side effects of a rejected promise because Nodejs doesn't free up the file descriptor. So I didn't investigate the logs for "Unhandled promise rejection detected" and it's at best a strong suspicion.

I think you raise an important point about the user experience. We should aim to fix all unhandled promise rejections, but it's inevitable that we will ship unhandled rejection bugs. If an unhandled rejection means some leaked memory and after a long time degraded performance maybe it's still better than Kibana being offline because it's constantly restarting.

I couldn't find any official position from Nodejs, just issues with huge threads. The best written up doc I could find was https://github.com/promises-aplus/unhandled-rejections-spec/issues/1 and the comparison with uncaught exceptions and uncaught promises is useful.

watson commented 2 years ago

@pgayvallet Just echoing what @rudolf said. They way I look at it is that we're setting us self up for hard-to-debug problems by allowing unhandled rejections in a long running process (e.g. memory leaks). IMO it's better to rip the bandaid off quickly and crash the program. That's also why Node.js core have decided to make this the default behavior in newer Node.js versions.

pgayvallet commented 2 years ago

That's also why Node.js core have decided to make this the default behavior in newer Node.js versions

It's way too early in the week to get me started on most of NodeJS's decisions :trollface:

But thanks you both for the clarifications, and I agree that it therefor makes sense to align.

On any given day about 1% of clusters have an unhandled rejection. There are a couple of outlier clusters with multiple rejections per day which we need to investigate but I feel like these numbers might be good enough to start transitioning towards crashing.

We probably need confirmation from the higher layers though, don't think we alone can take such impactful decision? 1% cluster per day can still be considered extremely high compared to the average SLA ihmo?

rudolf commented 2 years ago

1% cluster per day can still be considered extremely high compared to the average SLA ihmo?

Yeah I think we need to be able to better quantify this in terms of how much downtime clusters could experience.

afharo commented 6 months ago

We might be able to tackle this after https://github.com/elastic/kibana/pull/181456

pgayvallet commented 6 months ago

I already expressed my concerns on this issue, so I won't elaborate them further.

If we are to do it, I'd just like to make sure we'll be adding the escape hatch that was discussed in this issue to let customers be able to toggle back to the previous behavior

We might want to provide a flag for users to switch Kibana back to the old non-crashing behaviour if they really want to (this is just a safeguard in case there's something we haven't thought of)

afharo commented 6 months ago

++ IMO, even before making it a default, I'd enable it progressively in controlled scenarios: