apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.2k stars 13.65k forks source link

Superset becomes unresponsive if a database is not responding #14053

Open aaronfeng opened 3 years ago

aaronfeng commented 3 years ago

Expected results

If there's an issue with a single database it should not crash the whole system.

Actual results

A couple days ago we experienced a production outage when we noticed all of the web frontends (~10) were unresponsive and removed by the ELB. After restarting all of the web frontends we were able to load some of the pages. However, it became unresponsive again shortly after. Web server logs didn't reveal any obvious errors. Eventually we noticed the databases tab doesn't load at all after a rolling restart. It turned out that our Hive server were not able to accept new connections due to hitting its thread limit. After rebooting Hive server, Superset started to function properly again.

I don't believe many people were trying to load the Databases tab, but people were trying to run adhoc queries using the SQL Editor. Loading the SQL Editor caused the Databases dropdown to load which I believe is similar to loading the Databases tab. During this time the Databases and Schema dropdown were blink.

We are running Superset 1.0.1 Docker image.

Screenshots

Didn't take any screenshots, but the Databases tab was completely blink as if it was trying to load.

amitmiran137 commented 3 years ago

sounds severe! if you could upload logs that would be helpful @aaronfeng

aaronfeng commented 3 years ago

@amitmiran137 unfortunately I didn't see anything that was useful in the logs. I assume you mean web server logs? It was a lot of trial and error to debug the issue.

CraigChaffee commented 3 years ago

I dealt with this directly. From what I can tell several operations fail or take unnecessarily long if a db connection fails when they definitely shouldn't. This behavior multiplied the complexity and extent of an otherwise simple outage blocking access for all users on all other DBs.

When a database connection takes a long time to respond we've seen the following endpoints also timeout:

List databases api /api/v1/database/ (breaks sql lab) List Databases page /databaseview/list/ Database Edit/Save

It's not clear why any one of these pages would need to verify all connections before loading data. I haven't dived into the code, but this seems like a major design flaw. During our outage there was no way to debug or modify anything through the web interface. What's worse is that by design Superset doesn't allow deletion of entities whenever there are dependent tables (delete on cascade) so dealing with the permanent deletion of any database is colossally painful. Anyone who has tried to delete and been blocked by a maze of foreign key constraints. I realise not cascading on delete is safer, but not allowing a safe way to do this makes administration unreasonably cumbersome.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

rusackas commented 1 year ago

While this issue is obviously stale, it still sounds gross. Did anyone on this thread happen to run into this again or gain any further insight they can add? Is this still a risk in (significantly) newer versions of Superset, e.g. 2.1.0?

CC @betodealmeida @nytai @bkyryliuk in case they have any insight into whether this is addressable and/or ought to remain open (despite the reported version of Superset no longer being officially supported).

betodealmeida commented 1 year ago

As a rule of thumb we separate API endpoints that hit only the metadata database from API endpoints that hit analytical databases. Requests to the latter should be asynchronous and non-blocking (eg, that's how we do loading function names for the autocomplete in SQL Lab). That being said it could be that there are places where we're not doing that properly. I remember fixing a few use cases (including the function names), but it would be nice to do an audit.

iercan commented 7 months ago

We encounter this issue periodically on Superset 2.1.1. When Superset becomes unresponsive, we inspect Trino and identify a long-running query. Once we terminate that query, everything returns to normal.

rusackas commented 7 months ago

Just a note that we no longer support Superset 2.x. Is anyone able to repro this in 3.x?

swaresh commented 6 months ago

I have encountered this issue with Superset 3.1.0 as well. Any pointers on how to resolve this?

SGH-N commented 3 weeks ago

I've encountered the same issue with Superset 4.0.2 on both Chrome and Firefox browsers. The query runs without any problems when using DBeaver (JDBC client) and completes in just 2 seconds.

image

image