elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
309 stars 112 forks source link

Add built in connection pool #108

Closed fishcakez closed 6 years ago

fishcakez commented 6 years ago

Adds a built in connection pool that will increase throughput by 3-7% for most single queries and load levels in Ecto. Other workloads may get 0%-10% improvements.

The pool uses an ETS queue and some other ETS trickier to avoid tracking busy connections/clients. The later feature (if this was only pool style) would allow us to stop using the pdict so that we can ensure no data leaks for non-closure transactions.

We might be able to get better performance by removing either or both of these ETS use cases.

IO overload handling is better (and simpler) than Sojourn pool. In future we could improve the case where the database is down or other side of netsplit because we can know when there are no connections.

CPU overload handling is not as good as Sojourn pool but could be improved by limiting the queue size and dropping all new checkouts above that size.

There is no overflow. This is something we could do in future if we have a simple and good algorithm.

Idle ping handling is improved over other pools because pings will never synchronize and happen as rarely as possible.

Timeout handling is improved over other pools because an absolute deadline can be used (useful for multiple repo or multi service calls) and the queue time is included in the timeout if not absolute. Also queue timeouts are DBConnection.ConnectionError errors, like Sojourn pool, so easier to handle.

Processes hibernation occurs efficiently and does not affect performance. It could be more efficient if hibernate on backoff.

Overhead for pool_size: 1 is insignificant compared to DBConnection.Connection so could replace it.

josevalim commented 6 years ago

This looks great! What is the plan here? Do we plan to remove support for at least the Sojourn pool then? I think we should come up with a plan to remove code before adding even more. :)

I also recall you saying this supports ownership out of the box. Should we force this pool to be used whenever we want the ownership style tests?

fishcakez commented 6 years ago

What is the plan here?

There isn't a plan here yet, just that I knew we could have a faster pool.

Do we plan to remove support for at least the Sojourn pool then?

I would like to do this very much. I would actually like to remove all the pools and just use this one and ownership. Or some other big simplification.

An alternative would be to always supervise connections away from the actual pool process like this so that most of the logic is the same shared off-tree worker, with light messaging passing/queuing loop in the pooled worker. By connecting in another process we can improve the queuing in all existing pools (except Sojourn that already does it). I think it would also allow us to simplify Ownership by not having a nested pool and always running pool_size proxy processes.

I also recall you saying this supports ownership out of the box. Should we force this pool to be used whenever we want the ownership style tests?

🤔 this doesn't actually work with ownership because of https://github.com/elixir-ecto/db_connection/issues/106 (nb this is issue trivial to solve for most cases by having a watcher per pool type). If we stopped allowing custom pool behaviors it could work by using the same message passing semantics, i.e. contract on messages rather than functions. This would also solve the issue of having to pass pool option.

josevalim commented 6 years ago

I would like to do this very much. I would actually like to remove all the pools and just use this one and ownership. Or some other big simplification.

:+1:

If we stopped allowing custom pool behaviors it could work by using the same message passing semantics

:+1:

fishcakez commented 6 years ago

We should add an expire_timeout or similar option that disconnects a connection on checkin/poll if it is older than a timeout.

michalmuskala commented 6 years ago

Today @DavidAntaramian asked on slack about gathering some statistics about the pool - in particular stats about saturation and load of the pool for optimising and monitoring the application.

If we're designing a new pool, maybe it would make sense to also include some APIs to gather this data or at least provide some hooks so it could be integrated with some metrics library?

michalmuskala commented 6 years ago

Also, regarding overflow. I think for database connections actually a way to dynamically change the pool size rather than poolboy-style overflow would be more useful, because of the connection cost.

If we had a way to gather metrics, it would become possible to have a component regulating the pool size (this could be integrated as an external library), growing it during overload and decreasing during quiet time, but smoothly.

fishcakez commented 6 years ago

If we're designing a new pool, maybe it would make sense to also include some APIs to gather this data or at least provide some hooks so it could be integrated with some metrics library?

Yes definitely but for metrics we can worry after the implementation I think. It will be easier to do once we have removed poolboy/sojourn, we do not need to get ahead of ourselves :D.

If we had a way to gather metrics, it would become possible to have a component regulating the pool size (this could be integrated as an external library), growing it during overload and decreasing during quiet time, but smoothly.

This is what the Sojourn pool does already FWIW (and provides hookable metrics). I am slightly :-1: on any overflow feature to be honest because of the complexity.

fishcakez commented 6 years ago

@michalmuskala I removed the monitoring as its not required, unsure how minimal the increase in performance is but it simplifies the code slightly.

michalmuskala commented 6 years ago

@fishcakez I got some failures when I tried the changes:

07:34:36.676 [error] GenServer #PID<0.187.0> terminating
** (CaseClauseError) no case clause matching: {-576460751509, #Reference<0.4040485667.3719692292.85506>}
    (db_connection) lib/db_connection/connection_pool.ex:155: DBConnection.ConnectionPool.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:timeout, #Reference<0.4040485667.3719561220.85547>, {-576460750509, -576460751509}}
State: {:ready, #Reference<0.4040485667.3719692292.85497>, %{delay: 0, idle: #Reference<0.4040485667.3719561220.85548>, idle_interval: 1000, interval: 1000, next: -576460750509, poll: #Reference<0.4040485667.3719561220.85547>, slow: false, target: 50}}
fishcakez commented 6 years ago

Oh there was a subtle bug in the old code that didn't get realized until that change.

michalmuskala commented 6 years ago

Here are some benchmarks with raw db_connection, without actually connecting to a database:

https://gist.github.com/michalmuskala/e503d650718d122fbdb478372b83c46a

michalmuskala commented 6 years ago

And here are some with postgres connecting to a real database:

https://gist.github.com/michalmuskala/00d163ddd0c7eff1770b27cd28ad8dfb

redink commented 6 years ago

Any updates?

fishcakez commented 6 years ago

I've moved to raw message instead of GenServer call. This would be ready to merge, and then work on changes to support the raw message, instead of option in later PRs

josevalim commented 6 years ago

Unique integer should be fast if it is not monotonic. Basically a simple counter per scheduler. --

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

fishcakez commented 6 years ago

Merging to unblock work on moving ownership to use this pool. We can improve this pool further later on.