apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.22k stars 13.33k forks source link

[SIP-141] Global Async Queries 2.0 #29515

Closed villebro closed 2 weeks ago

villebro commented 1 month ago

[SIP-141] Proposal for Global Async Queries 2.0

Motivation

With [SIP-39] Global Async Query Support (GAQ for short) still being behind an experimental feature flag, and not actively maintained, I've been thinking about ways we could simplify the architecture, and finally make this feature generally available in a forthcoming Superset release. I feel the following issues have all done their part to keep this feature from gaining wide community traction:

Having said all this, the feature is still as relevant today as it was when the original SIP was opened, and I think stabilizing this feature is very important because Superset's current synchronous query execution model causes lots of issues:

Proposed Change

To simplify the architecture and reuse existing functionality, I propose the following:

The async execution flow is changed to be similar to SQL Lab async execution, with the following changes:

Some random thoughts:

New or Changed Public Interfaces

New dependencies

In the base proposal, I suggest not adding any new dependencies, and simply supporting polling. However, we may consider using Server-sent events as noted by @betodealmeida .

Migration Plan and Compatibility

  1. Remove feature flag + related code, Websocket worker code
  2. Remove references to websocket deployment/service from Helm chart
  3. Add new feature flag SIMPLIFIED_GLOBAL_ASYNC_QUERIES

Rejected Alternatives

nytai commented 1 month ago

To support automatic cancellation of queries, we add a new optional field poll_ttl to the query context, which makes it possible to automatically cancel queries that are not being actively polled. Every time the cache key is polled, the latest poll time is updated on the metadata object. While executing, the worker periodically checks the metadata object, and if the poll_ttl is defined, and if the last poll time exceeds the TTL, the query is cancelled. This ensures that if a person closes a dashboard with lots of long running queries, the queries are automatically cancelled if nobody is actively waiting for the results. By default, frontend requests have poll_ttl set to whichever value is set in the config (DEFAULT_CHART_DATA_POLL_TTL). Cache warmup requests would likely not have a poll_ttl set, so as to avoid unnecessary polling.

I'd suggest making this configurable or to respect the "cancel queries on window onload event" that exists in the db connection settings. Some workflows involve dashboard being accessed multiple times a day, so there is some benefit to running and caching the queries even though there isn't someone actively waiting for them. Thinking of a case where someone opens a dashboard, it takes too long to load so they pack up and commute to the office, and open up the dashboard once they're at their desk.

michael-s-molina commented 1 month ago

I've been thinking.... the problems of the sync model are clearly described by this SIP. The proposed solution is to use an async model with polling, but @betodealmeida and @rusackas have a valid point about latency for real-time data and also possible push features. Server-side events could be used but it wouldn't be sufficient as we clearly need bi-directional communication. Considering all requirements, shouldn't we use this opportunity to actually fully adopt WebSockets as the single solution to reduce the complexity of many modes of operation? It seems it would address all our requirements and avoid a situation where we add async polling now to later introduce another method for push features.

villebro commented 1 month ago

I feel the issue ultimately comes down to maintainability of the feature. Currently, the main issues that we're facing with synchronous queries are as follows:

  1. no deduplication of queries
  2. no mechanism for cancelling running queries
  3. queries locking webserver threads, making it difficult to scale infra appropriately

While the current WebSocket solution addresses the third point above, the other points are not addressed by the current solution, and in my experience, they're far more critical for enterprise deployments. I'm all for hardening and even expanding the use of WebSocket solution if we can find people to maintain and develop it further. However, given the low community involvement on the superset-websocket codebase, I feel the maintenance burden of this component exceeds the benefit of it in the current architecture.

If we do feel WebSockets are critical for the future roadmap, I feel we still need to simplify the architecture, and support query deduplication and cancellation, as they are not within the scope of the current GAQ solution.

michael-s-molina commented 1 month ago

If we do feel WebSockets are critical for the future roadmap, I feel we still need to simplify the architecture, and support query deduplication and cancellation, as they are not within the scope of the current GAQ solution.

@villebro Exactly! This 👆🏼 what I meant by fully adopting WebSockets. We would need to simplify its architecture, add missing features and use it as the primary solution in Superset to ensure maintainability.

villebro commented 1 month ago

A quick summary from the discussion on this SIP in the Town Hall meeting on 2024-07-12:

The SIP will shortly be updated to reflect these changes. Furthermore, a weekly meeting has been setup for coordinating the effort to redesign the WS implementation to bring the core features and changes proposed in this SIP, most notably query deduplication, management UI for queued/running queries, and architectural simplification. The meeting will take place on Mondays at noon PST (all interested are welcome to join!)

harshit2283 commented 1 month ago

To add to the general consensus, WebSockets are crucial for large-scale deployments, especially when handling a significant number of active users. Long polling is inefficient for this purpose, and horizontal scaling isn't ideal since it requires scaling the supporting infrastructure as well.

I'd be happy to contribute towards making WebSockets mainstream.

gempir commented 1 month ago

I think the biggest hinderance to the adoption of the new WebSocket server is mostly the missing documentation.

The most I could find this in the README https://github.com/apache/superset/blob/c7b8ae9013dcbffb3bb0dda4dc1d57cc9c18048c/superset-websocket/README.md?plain=1#L73

And overall it feels like a feature that sounds very much in development if there is nothing about it in the official documentation.

That's why at least at my company we did not deploy the websocket server.

villebro commented 2 weeks ago

We are closing this SIP and will be opening a new one to reflect a new direction that was established during discussions. But in summary:

villebro commented 2 weeks ago

Follow-up SIP here: #29839