Open watson opened 4 years ago
Pinging @elastic/kibana-operations (Team:Operations)
Pinging @elastic/kibana-platform (Team:Platform)
I agree that we should rather crash from v8.x
I just had a chat with @legrego about this and we had the following thoughts:
@watson I guess we missed the 8.0 train here. Given this is now the default behavior in dev mode, do we consider this good enough, or are we still planning on having this behavior in production mode too?
There was some discussion about this for 7.16 in https://github.com/elastic/kibana/issues/98650 and the numbers weren't very good in 7.x
Currently 1.6% of 8.x clusters on cloud got an unhandled rejection over a 4 week period. On any given day about 1% of clusters have an unhandled rejection. There are a couple of outlier clusters with multiple rejections per day which we need to investigate but I feel like these numbers might be good enough to start transitioning towards crashing.
We should aim to roll this out incrementally:
--no-warnings
flag (can we even detect this from JS?)As a first step we need to make it possible to opt out of --no-warnings
. Since this flag is added by the kibana
startup script users would have to edit the script to remove it. Perhaps we can remove this flag from the script and add it to config/node.options
instead. That way users can remove the flag by changing configuration files. In order to do this we would need to modify the existing config/node.options
file to add the flag if it's not present.
I feel like these numbers might be good enough to start transitioning towards crashing.
Apart from decreasing the overall user experience and generating frustration, I'm still not sure to see the upsides of having Kibana crash in case of unhandled promise rejection instead of printing a warning to be absolutely honest. Maybe someone can enlighten me here? Did we identified cases where Kibana is effectively in an unknown state after an unhandled rejection where we would have preferred the process to terminate instead?
Maybe someone can enlighten me here? Did we identified cases where Kibana is effectively in an unknown state after an unhandled rejection where we would have preferred the process to terminate instead?
We have had SDH's where the primary symptom was "kibana is slow" and continued memory growth that looked like a memory leak. After investigation the memory wasn't consumed by the heap but by Nodejs "external memory". External memory doesn't show up in heap snapshots, but we did see a huge amount of open file descriptors (IIRC 30k outgoing socket connections) while Kibana was doing just a small amount of traffic.
I only later realised that this is one of the potential side effects of a rejected promise because Nodejs doesn't free up the file descriptor. So I didn't investigate the logs for "Unhandled promise rejection detected" and it's at best a strong suspicion.
I think you raise an important point about the user experience. We should aim to fix all unhandled promise rejections, but it's inevitable that we will ship unhandled rejection bugs. If an unhandled rejection means some leaked memory and after a long time degraded performance maybe it's still better than Kibana being offline because it's constantly restarting.
I couldn't find any official position from Nodejs, just issues with huge threads. The best written up doc I could find was https://github.com/promises-aplus/unhandled-rejections-spec/issues/1 and the comparison with uncaught exceptions and uncaught promises is useful.
@pgayvallet Just echoing what @rudolf said. They way I look at it is that we're setting us self up for hard-to-debug problems by allowing unhandled rejections in a long running process (e.g. memory leaks). IMO it's better to rip the bandaid off quickly and crash the program. That's also why Node.js core have decided to make this the default behavior in newer Node.js versions.
That's also why Node.js core have decided to make this the default behavior in newer Node.js versions
It's way too early in the week to get me started on most of NodeJS's decisions :trollface:
But thanks you both for the clarifications, and I agree that it therefor makes sense to align.
On any given day about 1% of clusters have an unhandled rejection. There are a couple of outlier clusters with multiple rejections per day which we need to investigate but I feel like these numbers might be good enough to start transitioning towards crashing.
We probably need confirmation from the higher layers though, don't think we alone can take such impactful decision? 1% cluster per day can still be considered extremely high compared to the average SLA ihmo?
1% cluster per day can still be considered extremely high compared to the average SLA ihmo?
Yeah I think we need to be able to better quantify this in terms of how much downtime clusters could experience.
We might be able to tackle this after https://github.com/elastic/kibana/pull/181456
I already expressed my concerns on this issue, so I won't elaborate them further.
If we are to do it, I'd just like to make sure we'll be adding the escape hatch that was discussed in this issue to let customers be able to toggle back to the previous behavior
We might want to provide a flag for users to switch Kibana back to the old non-crashing behaviour if they really want to (this is just a safeguard in case there's something we haven't thought of)
++ IMO, even before making it a default, I'd enable it progressively in controlled scenarios:
Currently, Kibana will exit with exit code 1 in case of an unhandled promise rejection during CI or development. If Kibana is started with the
--no-warnings
flag (which we do automatically in production when runningbin/kibana
), then Kibana will simply log a warning-level message to the logs.However, an unhandled rejection should IMO be considered a program error and the program is now in an unknown state. It's therefore advisable to crash the program. This functionality is, by the way, going to be the default behaviour in Node.js starting from version 15.
I therefore suggest that we make this the default behaviour of Kibana even when running in production. We should possibly target this for Kibana v8.0.0 in case someone relies on this functionality (which I hope no one does).