apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.39k stars 13.71k forks source link

[SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections #8574

Closed willbarrett closed 3 years ago

willbarrett commented 4 years ago

[SIP] Proposal for Implementing Connection Pooling for Analytics Database Connections

Motivation

Currently, Superset’s connections to analytics databases do not have long-lived connection pools. In most instances, a database connection is spawned immediately before a query is executed and discarded after a single use. This introduces a small amount of latency into every query. While most queries run against data warehouses are expected to be longer-running than a typical web application query, this latency will be noticeable when performing operations such as loading schema and table lists for display in the UI, or loading table definitions and previews. A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads available to the application across all processes. Under peak load, this can lead to hammering databases with a large number of connection requests and queries simultaneously. This does not allow us to provide meaningful upper bounds for the number of available database connections. Implementing connection pooling at the process level will allow us to provide a configurable maximum number of connections that Superset is able to leverage.

Proposed Change

I recommend we add a singleton object to hold a SQLAlchemy Engine instance for each configured database in the application. I believe that engines should not be instantiated on startup, but instead instantiated on first use to avoid unnecessary connection negotiation.

I further recommend that we use the SQLAlchemy QueuePool as the default pool implementation while retaining the ability to configure Superset to use a NullPool, configurable via the Database setup system. I would like to make the pool_size and max_overflow properties configurable, as well as whether to treat the queue as FIFO or LIFO and the pool_pre_ping option, and customization of the connect_args passed on engine instantiation (which controls things like connection timeouts). I believe that LIFO queues will be preferable for infrequently-accessed database connections, as they will generally maintain a lower number of connections in the pool, and thus should be the default. I would also recommend that for LIFO queues we default to the pool_pre_ping option to trigger pool member invalidation when necessary, as stale connections are more likely under the LIFO configuration.

As part of this work, I recommend moving engine instantiation code out of the Database model and into its own module, probably as part of the singleton that will maintain an in-memory list of database pools. We will need to update the code that alters database records to reinitialize the processes’ engine after Database record creation and update.

One further change will be in regards to Celery’s connection pooling. Right now, we use the NullPool in Celery and instantiate database connections when needed. For Celery, I would recommend moving to the StaticPool, which will create one database connection per worker process. Because Celery reuses worker processes, this will reduce the overhead on backgrounded queries. An alternative would be to move to threaded workers (gevent or eventlet) and maintain the same pool configuration as the UI. I’d love suggestions from the community on what to recommend here.

New or Changed Public Interfaces

This change should have minimal impact on the UI, the primary change being the addition of more configuration options in the Databases section. I would recommend having sensible defaults and hiding the pool setup under an Advanced configuration section. I plan to provide guidance on the meaning of the pool_size, max_overflow, and FIFO vs LIFO configuration parameters, both in the UI and in new documentation. The configuration approach will be hybrid, allowing global configuration of defaults in config.py, with overrides available on a per-database basis in the UI.

New dependencies

No additional dependencies will be necessary.

Migration Plan and Compatibility

A database migration will be necessary to add an additional field to the DBs table to hold connection pooling arguments.

No URLs will change as part of this work. I would like feedback from the community, particularly engineers at Airbnb, Lyft, and other organizations with large Superset installs, on what sensible defaults for connection pools would look like.

Rejected Alternatives

The primary alternative rejected is the current, connection-pool-less state. While this state allows for only the number of connections needed at any given time to be in use, it falls down with regards to performance and predictability of number of open connections at any given time.

I also considered the other connection pool implementations in SQLAlchemy, but it appears that our use-case is best served by the QueuePool implementation.

One additional piece I considered was providing an option to the user of configuring an overall, rather than per-process, maximum number of connections. In that case, processes would need to “check out” the ability to make a connection from a distributed lock built in Redis, or the max size would need to be large enough to provide at least one connection per live process. While I think this would be a better experience for most users, I’m concerned about the additional application complexity required by such a change. Would processes need to register themselves in Redis on boot so we could get a correct count of the number of live processes? What happens when we need to scale up beyond the global maximum number of database connections? I think solving those problems is not easy, and most use-cases will be well-enough served by a per-process max number of connections.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.96. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

mistercrunch commented 4 years ago

There's a challenge here around the fact that each subprocess (gunicorn worker and celery worker) gets its own pool, and that each one of those can connect to multiple databases. Depending on whether you configure these things to use threads or workers (subprocesses), you can end up with a lot of connections very quickly that is vastly bigger than the number of active connections. One problem is that while you probably want to cap the number of connections to your db, you want do have a dynamic number of workers as you need more capacity.

There's also a challenge around the fact that threading and SQLAlchemy pools have intricate issues. There are endless stackoverflows documenting this.

Another thought is that we may want to limit concurrency to analytics databases, but this approach is not achieving that in any way as there's no global state shared across server/workers. This would need to get handled as some sort of environment global variable (redis?) that would keep track of the number of active connections.

What's the cost of spawning / destroying a connection? On the server / on the client? Milliseconds of CPU time? Worth it?

willbarrett commented 4 years ago

Yes, there is a challenge around each process getting its own pool. I see your point that for some systems (Redshift is getting my side-eye here) having connections held open could be a larger problem. I'll amend the recommendation above to make one of the options a NullPool. This would retain the existing behavior for databases that are unable to handle enough open connections.

RE: the cost of spawning/destroying the connection, I think it's impossible to come up with a really solid specific number. I think the range is likely to be between around 10 milliseconds in the case where the server is a more application-focused one (Postgres, MySQL, etc.) living on the same network up to potentially multiple seconds for systems separated geographically or with a chattier protocol for initiating a database connection. Under load, these numbers can get quite large.

A goal down the line would be to limit total connections, but I'd like to push that off into a future SIP. I believe a reasonable way to attack that would be to implement a custom pool that leverages Redis as a distributed lock for the system. The Redis lookups for this system will potentially add a fair amount of latency, so that's something we should discuss separately in my mind.

RE: the intricacies of SQLAlchemy in a threaded environment, it appears that connection pools and engines are safe, but that sessions and connections are not. This makes sense intellectually - the connection pool and engine are designed to protect the non-thread-safe resources they contain. None of this is safe across a process boundary, so the multiprocessing module in Python is a danger to connection pools. We already have this issue when it comes to the metadata database. Post-fork, any existing connection pools would need to be recreated. Some database engines implement database connections as blocking calls from my research, which will break multithreading due to the Global Interpreter Lock. I think for us to really achieve the best throughput we will want to use lazy-loaded connections from process-based Celery workers that then become long-running connections. This, however, is multiple SIPs away, and I anticipate that we will need to retain the ability to run all queries in the foreground for the foreseeable future.

RE: is it worth it? I think that depends heavily on the workload. In terms of freeing up processor cycles on the web server, it could be very worth it. If there is a substantial geographical separation between Superset and the database accepting the connection, or if connections are slow to instantiate on that server, it will definitely be very worth it. I think providing the option of connection pooling could greatly accelerate certain workloads, though you have convinced me that retaining the option of a NullPool is a wise choice.

mistercrunch commented 4 years ago

Another thing to cover as part of this sip is the configurability of the pools per database. Potentially heterogenous params based on pool type, it's hard to come up with a balance between something comprehensive and static VS flexible. We could have both: maybe different presets in a dropdown list, and the possibity to override with actual pool objects in a dict in superset_config.py ...

Planning on making another comment addressing your points above.

suddjian commented 4 years ago

A global redis-based pool would be very useful from a user perspective. Doing that in a future SIP makes a lot of sense though, since right now there is no connection limit at all.

I suggest avoiding offering any configuration in the UI until then, as a config intended to be applied per-process would be difficult to reliably translate to a global config.

mistercrunch commented 4 years ago

I think allowing for people to configure their connection pool is a great thing, let's provide ways to do this. I see two main approaches as well as a hybrid.

1 - by config: Doing it as configuration, in superset_config.py. If going that route, I think it's best to give user the full power of configuration as code, and allow them to pass an instantiated SQLAlchemy pool object. This enables them to use whatever pool class they see fit, and use whatever parameter they want. In theory they could even create their own pool class here and do as they see fit. This should probably be a dict keyed by database_name using instantiated pool objects as values.

2 - in UI Doing it in the UI, and expose only the common pool type(s) and parameters that matter. This is likely to be simple at first (only allow QueuePool and common parameters like pool_size, max_overflow, timeout, use_lifo). An alternative and less form-heavy approach would be to use the current Extra json approach and add a key for queue_pool_parameters and document how people can pass things like pool_size, max_overflow, timeout, use_lifo.

3 - hybrid A hybrid approach could be easy to implement to, where we look for the dict config and fall back on the UI config.

For reference, where Extra looks like today:

Screen Shot 2020-01-07 at 9 39 57 AM
villebro commented 4 years ago

Personally I like the idea of supporting a hybrid approach, giving precedence to the code based config. However, given that Superset usually runs on multiple concurrent worker processes, I think the only way of achieving true pooling would require some sort of locking outside python scope (Redis being the top contender as mentioned above). While it does propose it's own set of challenges (not to mention added complexity), somehow it feels simple enough to be manageable, especially if it can be rolled out as an opt-in feature. Therefore I'd vote to at least try building a Redis locking POC, as it should be pretty quick to put together and see what type of overhead or other problems it might introduce.

mistercrunch commented 4 years ago

I think it should be possible to build a SQLAlchemy-compatible pool called CappedDistributedPool that'd leverage Redis or something else (as long as it's not Zookeeper :) to limit concurrency, and hook it in using the configuration hook.

willbarrett commented 4 years ago

I like the hybrid approach as well. I'm not excited about adding this configuration to though extra - I think it would deserve its own database columns, especially considering the table holding these records is unlikely to be massive, thus making migrations manageable. Unstructured data in an RDBMS is a pretty strong antipattern that I'd like to avoid exacerbating in this case.

I think there is a valid use-case for un-capped pools, especially when the database connections are to datastores like BigQuery and Athena. Anywhere that supports massively concurrent access. Capped pools are more important for systems like Redshift, Postgres, MySQL, etc. where too many connections open can cause difficulty. I can put together a proof of concept of a Redis-lock-based, capped, distributed pool and we can take a look. I expect to be able to get to it in a couple of days.

mistercrunch commented 4 years ago

I think priority #1 is pool "configurability" per database (achieved with a superset_config dict hook), and #2 is some way to do this in the UI (important for Preset until we have some sort of per-tenant configuration hook).

I'd wait for a direct request prior to actually building a CappedDistributedPool POC and assume that it'd be possible as it becomes needed, building on top of #1 and #2.

mistercrunch commented 4 years ago

I don't feel strongly about adding a column or adding on to extra, but I think using extra here for this long tail of semi-structured configurability is ok given the use case. The "proper" 3NF model that would support all the options would be super unmanageable.

Hybrid data models (mixing structured and semi-structured fields) are becoming more common over time, database support them much better than they used to. That doesn't mean that's right, but clearly reflecting a reality of rapidly-evolving / complex schemas.

willbarrett commented 4 years ago

I'm willing to compromise on the unstructured column. I've updated the SIP to reflect this discussion.

metaperl commented 4 years ago

I think priority #1 is pool "configurability" per database (achieved with a superset_config dict hook),

Related: https://github.com/apache/incubator-superset/issues/9029

etr2460 commented 4 years ago

A couple questions/comments:

Will this SIP also cover using connection pooling for connecting to the metadata database from Celery workers? That's a common issue that we've seen, where if too many connections get opened in Celery, we start to overwhelm the total connections available to MySQL (this may also be related to not closing connections after finishing an async query, this is uncertain).

What would the user experience be if there is no available connections when making a sync query? Would that request stall and wait for a connection to be available or instantly fail?

Finally, although this may be tangental to the work here, a CappedDistributedPool would be extremely useful for us, especially if it capped connections (essentially currently active queries) by both user and analytics db. This would let us rely on Superset to throttle/queue queries prior to sending them to dbs so we can make best use of our cache.

willbarrett commented 4 years ago

Some answers/responses for @etr2460

This SIP is for connection pooling for analytical databases, not the metadata database. I'd be happy to talk with you to understand your metadata DB issues, but do not want to consider that problem as part of this SIP.

I think the most reasonable behavior for a sync query that cannot check out a connection would be to block and wait for a connection to be available, with the failure state being an eventual timeout, but this isn't a strong opinion. I'd be interested in other's thoughts on the matter.

I understand the desire for a CappedDistributedPool, but want to treat that as a separate piece of work. The work proposed in this SIP would be a precondition for creating a CappedDistributedPool, so we'd be moving in the right direction.

mistercrunch commented 4 years ago

[tangential to this SIP and related to @etr2460 's comment] about Celery connection pooling to the metadata database, it seems reasonable to think that the connection requirements in the Celery context are different from the typical web-server-building-a-page use case, and it'd be good to offer configurability that is context aware (celery / web). Personally I don't think it requires a SIP though if it's just a configuration hook.

etr2460 commented 4 years ago

Thanks for the answers Will, it all makes sense to me. I agree that blocking and waiting is the right experience for now. No further questions!

villebro commented 4 years ago

I like the idea of being able to use connection pooling wherever necessary, even for metadata connections if necessary.

willbarrett commented 4 years ago

@villebro completely agreed RE: connection pooling everywhere being desirable. Given the structure of the system though, pooling for the metadata database is a special case. Connection pooling should already be in use for foreground connections, but Celery presents special concerns, which is why I'd like to treat it separately. We definitely share the same end goal though!

junlincc commented 4 years ago

🏷database

rusackas commented 3 years ago

Approved!

rusackas commented 1 year ago

Does anyone here know the implementation status of this proposal?

thinkh commented 1 year ago

I would be also interested in an update for this proposal. Can anyone post something about this?

villebro commented 1 year ago

I think this SIP is stale. It may also need to be revised to cover Global Async Queries (or not, I haven't thought this through), as that's probably what we'll try to work towards in the long term.

thinkh commented 1 year ago

Ok, thanks for your reply.

john-bodley commented 1 year ago

So it seems like this functionality may already have been partially implemented, sans the configuration within the UI.

Specially the _get_sqla_engine method has the ability to either use a NullPool (default) or to fallback to the SQLAlchemy default, where if not set reverts to using the dialect specific configuration (or QueuePool if not defined).

If the connection pool settings need to differ from the metadata database then one can leverage SQLAlchemy binds via Flask-SQLAlchemy's SQLALCHEMY_BINDS configuration.

The one missing step is exposing the ability to choose between using the NullPool pool-class or the dialect pool-class when invoking the get_sqla_engine method.

villebro commented 1 year ago

Rereading this now, I wonder if the correct design for this should in fact exist downstream of Superset, and not within superset, for the reasons explained by Max in his first post:

There's a challenge here around the fact that each subprocess (gunicorn worker and celery worker) gets its own pool, and that each one of those can connect to multiple databases. Depending on whether you configure these things to use threads or workers (subprocesses), you can end up with a lot of connections very quickly that is vastly bigger than the number of active connections.

Especially in the context of horizontal scaling of workers, keeping connections open on the worker processes will quickly lead to lots of idle open connections that are not efficiently utilized. So if someone has seen a generic connection pooler for SQLAlchemy connections, similar to Pgpool-II, that could be an awesome solution to this problem (although I'm sure configuring it would be a challenge in itself..).

rusackas commented 5 months ago

Is anyone intending to continue with this? If not, it'll be considered stale and move to the closed/deferred column.

mistercrunch commented 5 months ago

As much as this matters for OLTP proportionally, where you have potentially high transaction-per-second and super small unit of work (your average atomic operations are counted in low ms), I feel like it matter significantly less [proportionally] for analytics workload, where you average latency is more like hundreds or thousands of millisecs

rusackas commented 5 months ago

Ok... calling it dead in the water... if anyone wants to rekindle this or keep the discussion going, just say the word and we can bring it back in the kanban.