allegro / turnilo

Business intelligence, data exploration and visualization web application for Druid, formerly known as Swiv and Pivot
https://allegro.github.io/turnilo/
Apache License 2.0
725 stars 169 forks source link

Error handling for timeouts #705

Open adrianmroz-allegro opened 3 years ago

adrianmroz-allegro commented 3 years ago

If druid returns {"error":"could not compute","message":"timeout"} we should show timeout to user

If some gateway between user and druid returns 504 we should show timeout to user.

mkuthan commented 3 years ago

I would like to get HTTP status end error message with root cause on UI. Currently I have to use Devtools to inspect response from Turnilo server (or from LB between Turnilo server and client).

How to map errors on Turnilo server? Some proposals:

Timeout error from Druid broker - HTTP 408, 500 or 502 Query errors from Druid broker - HTTP 400 or 500 429 errors from Druid Broker - HTTP 429 or 500 Gateway Timeout (502) from LB in front on Druid broker(s) - HTTP 502 , 500 or 408 ???

Or just map all errors as HTTP 500 with proper error message/details and handle HTTP 500 in a special way on the client (hide the status and show the root cause / details). For all other errors show the HTTP status code with details if available (e.g for HTTP 502 Bad gateway from LB between Turnilo server and client)

mkuthan commented 3 years ago

All errors handled on Turnilo Server and returned to the clients should be logged (with all available details).

adrianmroz-allegro commented 3 years ago

I would like to get HTTP status end error message with root cause on UI

I don't think that's correct. We should show user only actionable information. Right now we have two group of errors:

Of course all information should be preserved in logs/sentry for administrator to recover.

mkuthan commented 3 years ago

I agree that actionable information is better. But at the end you will not be able to figure out what was the reason of the error if the user sends the screenshot with generic error only. How to correlate this screenshot with the logs? So I think that some details on the screen are really needed, it is a help for admin not for end user :) Remember that some errors are network/lb errors - and there is no chance to see those errors in the Turnilo server log.

Let's be more specific, if Druid returns 429, Turnilo server could return 500 with message "Server is overloaded, please try later". This message is totally fine for end user as well. In the scenario when server for all errors returns 500, the actionable message must be provided by the server because the server knows the root cause. The generic message "Server error" on the client is unusable at all.

If LB returns 502 Bad Gateway, we should show exactly this message to the client (perhaps with some hint to refresh the page).

adrianmroz-allegro commented 3 years ago

But at the end you will not be able to figure out what was the reason of the error if the user sends the screenshot with generic error only. How to correlate this screenshot with the logs?

Report error to sentry and show report id on screen. Just like we do with frontend errors. I don't know how to correlate logs with reported error, but that's another issue.

Remember that some errors are network/lb errors - and there is no chance to see those errors in the Turnilo server log.

And will be reported as network errors to sentry.

adrianmroz-allegro commented 3 years ago

Error messages types on client

Bonus task: identify which errors should result in which type

User should get generic/minimal error message. If we have sentry (or something similar) we should present error id. Error details should be visible in hidden accordion (or something?)

Server errors

Bonus task: Format messages in Logger.ts so Kibana can understand levels, stack traces, timestamps etc.