elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
309 stars 112 forks source link

Add expiration timeout that limits max life-time of a connection #99

Closed fishcakez closed 3 years ago

fishcakez commented 7 years ago

This timeout should be randomized to prevent sync disconnects.

fishcakez commented 6 years ago

We can do this as part of #108, and just for that pool because for poolboy/connection it is awkward to handle.

jeffrom commented 6 years ago

Hello, I'd like to reopen this. It would be nice to recycle connections so we can better handle when our replica connections are unbalanced.

ymtszw commented 5 years ago

Pinging, since this is an ongoing issue when we use cloud database service with automatic failover feature, such as Aurora (provisioned cluster) in AWS RDS.

In RDS doc: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.Endpoints.html#Aurora.Endpoints.Cluster

The physical IP address pointed to by the cluster endpoint changes when the failover mechanism promotes a new DB instance to be the read-write primary instance for the cluster. If you use any form of connection pooling or other multiplexing, be prepared to flush or reduce the time-to-live for any cached DNS information. Doing so ensures that you don't try to establish a read-write connection to a DB instance that became unavailable or is now read-only after a failover.

With db_connection 2.1.0 (with Ecto 3.1.7 and MyXQL 0.2.6), failing over an Aurora cluster causes the sequence below. (Assumption: Elixir applications are connecting to an Aurora cluster via Writer (primary) endpoint with its DNS name. Reader endpoints innately work as reverse proxies so should not be a problem)

image (Figure: Connection counts. The orange line is for an initial Writer instance and the blue line is for an initial Reader instance. In this setup Reader endpoints are NOT used, the Elixir application is connecting to the cluster via Writer endpoint only. Drop-off of the orange line after 16:40 mark is the failover. Re-connections are targeting to the initial Writer instance as you can see. Application rolling-update at 17:00 "corrects" the situation, after that newly started applications are connecting to the new Writer instance)

This I believe is the same problem reported in https://github.com/xerions/mariaex/issues/180.

After several minutes after failover, reconnections can definitely target newly promoted Writers with DNS record being updated. But at the moment just after the failover, reconnections will likely grab the same IP address, since DNS record is not yet updated, OR updated but not yet recognized by application-running instances.

An obvious workaround for now, is to force application restart or trigger deploy several minutes after the failover, as shown above.

BTW, I haven't tried to issue write requests toward the reconnected (now read-only) endpoint. Does it causes exception & trigger reconnections? I might try that when I have time.

josevalim commented 5 years ago

Have you tried using this option? https://github.com/elixir-ecto/myxql/blob/b8848ba128fb9310aefff0965352373721e0801f/lib/myxql.ex#L104

That's what James proposes as the last suggestion in the linked issue. It exists on the new driver as well as Postgrex. This means you will continue running and reconnect automatically when a write is attempted.

As James also said, if you are reconnecting to the reader even after a disconnect, then it is most likely a DNS cache issue at the OS level. Have you configured those in any way? We don't have any application local cache.

ymtszw commented 5 years ago

@josevalim Thanks, I have certainly read that in the past but haven't tried :P

It looks like it takes an approach of "if it fails in a certain manner on write request, reconnect" right?

If so, it answers my own question in the last comment,

I haven't tried to issue write requests toward the reconnected (now read-only) endpoint. Does it causes exception & trigger reconnections?

...and the answer seems to be "yes, with MyXQL configuration," which is a good news. Will definitely try out.

ymtszw commented 5 years ago

BTW I do understand that the DNS cache is at the OS level, nowhere within applications.

The issue probably boils down to whether proactive reconnection via connection TTL is required in some way, or "lazily reconnect on failed writes" approach suffices.

schabou commented 4 years ago

Hello!

Currently running into another type of clash between Aurora. Aurora handles load balancing between read replicas using a DNS round robin strategy. I'm seeing an issue where a k8s managed elixir system spins up several pods to deal with a spike in traffic, and all the connections end up being pooled from the same replica, leading to a large imbalance in distribution of connections across the cluster.

I was wondering if you had any suggestions on how to mitigate this in the meantime?

josevalim commented 4 years ago

Hi @jvoegele! I remember we had recent discussions around this. Did you try some approaches? Can you share what worked out for you?

jvoegele commented 4 years ago

Hi @josevalim, @schabou et al.

I have implemented a "dirty hack" in our application code that--while not perfect by any means--does somewhat effectively work around the Aurora load balancing issue. What I did was use the configure hook that is available for the Repo configuration to introduce a small delay in between connections:

In /config/config.exs:

config :galaxy, Galaxy.Repo,
  configure: {Galaxy.Repo, :pre_connect, [[max_delay: 2_000]]}

In /lib/galaxy/repo.ex:

  def pre_connect(repo_opts, opts) do
    max_delay = Keyword.get(opts, :max_delay, 2_000)
    sleep_time = :rand.uniform(max_delay)
    Logger.debug("pre_connect sleep: #{sleep_time}")
    Process.sleep(sleep_time)
    repo_opts
  end

This has worked well enough to keep us afloat until we come up with a real solution to this problem. Even using this hack, the load is not perfectly distributed, but it is much better than without it.

Hope that helps!

schabou commented 4 years ago

I'll give this a whirl! This should tide us over, we're trying to reduce our load / connections to the DB cluster overall so we just need something to bridge that gap. Thanks so much for the help @jvoegele @josevalim

michaeldg commented 4 years ago

Hi,

I think two issues are being mixed here: 1) A time-out before opening or regaining a connection 2) Recycling connection after they have been used (enough).

Point #1 is useful and is usually implemented by suporting 'warm up' commands: Command(s) that are executed once a connection is being made, and another parameter of something to be executed when a connection is retrieved from the pool. To support that, you could set 'SELECT SLEEP(1)' as start up command to implement a 1 second sleep. In my opinion, this is out of scope for this issue.

Point #2 is what this issue is about. A useful issue to implement that most DB drivers already have. I work most with MariaDB and MySQL and therefore my comment might be a bit MySQL ecosystem oriented, but I think most of it applies to other RDBMS flavors as well. There are various reasons to implement this: a) Memory leaks in any part of the chain (application, connection pool, driver, load balancer, proxy, database server) are worked around b) Dynamic configuration parameters that have a session scope will keep the old parameter value until they reconnect c) Without recycling, changed circumstances such as no longer working user credentials could be unnoticed until the application restarts d) Some failover mechanisms (for example L4 load balancer with direct server return) rely on connections being reestablished to promote a new node to become the primary, or to do a switchover (planned failover) e) The protocol between the (mysql) client and (mysql) server does not indicate something to keep the connection alive. On quiet systems this could lead to unnoticed wait_timeout reached, therefore the connection having dissapeared, therefore an error when that connection is being reused.

TL;DR: Please implement the connetion recycle (after x executed queries, or after x connection lifetime), it is a useful feature.

josevalim commented 4 years ago

Yes, originally the issue was about 2. but some comments related to 1. have also appeared. This is still a desired feature, so if anyone wants to implement, please let us know and we will be happy to provide guidance.

fishcakez commented 4 years ago

In the mean time when using mariaedb/mysql with myxql it is possible to work around c/d) with disconnect_on_error_codes start_link/repo option (off by default). If facing c) may also need configure option to handle dynamic rotation of credentials but combined that may still not be enough if cant disconnect on any permission error. Also would still require seeing the error to recovery - similar when facing d) it will still have 1 error per connection.

For e) there is idle_timeout option that can work around.

FWIW we see a) too many combined prepared queries across a clusters connection pools and have been able to shrink connection pool size to work around it and d) master/readonly failover not switching unless restarting application (we use mariaex).

jeremyjh commented 4 years ago

My interest is in #2.a unfortunately - we have a slow memory leak in the postgres process related to the pg connection lifetime. On days without deployments we see our postgres memory usage grow quite high over the course of a few hours. Restarting the app process releases all that memory in postgres so I imagine cycling the connections would as well. I found this issue while looking for some kind of :max_lifetime setting on the pool members as a workaround. If no one is working on it I'd be willing to do so, but would definitely appreciate some guidance.

The one thought I have is that we should not implement strictly the same lifetime in each member of the pool; some percentage of the total lifetime should be used to randomly distribute reconnects over time.

stavro commented 4 years ago

@fishcakez @josevalim Do you have any recommendations on the code structure for this change?

Perhaps something along the lines of: the expiration timestamp should be stored in the connection state, and in the ConnectionPool, during checkin if it's been living too long, then we attempt to disconnect before being handed to new requests?

We really need this behavior, and I wouldn't mind someone from our team taking a whirl at it if we can align on what your expectations might be.

josevalim commented 4 years ago

Honestly, I am not sure about the best way to implement it without trying to implement it by itself. But it may also be that the best way to implement it is in the connection itself (for example, in Postgrex).

michaeldg commented 4 years ago

Implementing it is quite easy: Whenever a connection is retrieved from the pool, the system checks if the connection is too old or if it has executed too many transactions. Implemented it here makes it work that way independant of which RDBMS is behind. In my opinion that is the best way to go.

itayadler commented 3 years ago

can this be linked to the tcp:closed exception @ymtszw ? we get in production a bunch of those at random times, and I suspect its also related to the issue you're describing, we also use Amazon Aurora PostgresSQL serverless flavour.

mathiasose commented 3 years ago

I'm also very interested in this. We've actually been seeing some multi-minute outages because we essentially have a big memory leak in the database, since the connections all hog memory that they never release. The database reaches a threshold where it's near running out of memory and starts evicting processes and whatnot, making it temporarily unavailable to the application.

We're looking into how to handle out-of-memory-problems in the DB better, but it would also be nice to have the option to set max-age on connections on the application side.

josevalim commented 3 years ago

A PR for this would be appreciated. Here is a simple approach based on what @michaeldg proposed. All changes will happen in DBConnection.Connection:

  1. On handle_info connected/after_connect, compute when the connection will go stale
  2. On the pool_update, check if the connection has not gone stale by comparing the current time with the stale time, if it has, return {:disconnect, ..., ...}

Other than that, we need a test. :)

jeremyjh commented 3 years ago

@mathiasose I thought it might be worth sharing that one source of our leaks in Postgres were periodic refreshes of materialized views. Certain data structures are never released from memory in the connection's process in the DB, I'm not sure what that has to do with refreshing the views but it was easy to reproduce and isolate memory growth associated to those statements. Once we moved all those refreshes to short-lived processes we haven't had any issues with memory in our Postgres database.

josevalim commented 3 years ago

A similar feature to the one requested here has been implemented here: https://github.com/elixir-ecto/db_connection/pull/240

For example, if you want connections to live at most for 15 minutes, you can have a separate process that calls DBConnection.disconnect_all/3 in a loop every 15 minutes. Having a function like above is also good if you want to trigger reconnection on specific moments.

gheorghina commented 3 years ago

A similar feature to the one requested here has been implemented here: #240

For example, if you want connections to live at most for 15 minutes, you can have a separate process that calls DBConnection.disconnect_all/3 in a loop every 15 minutes. Having a function like above is also good if you want to trigger reconnection on specific moments.

I was working on a custom monitoring as well last week. Wouldn't it make more sense to disconnect all only if required? I would be interested in this feature with an increased frequency to reduce the downtime period, but I would like to avoid having all db_Connections being recreated every 30 or 60 seconds. Would you have concerns always disconnecting frequently even if not necessary?

josevalim commented 3 years ago

@gheorghina the original case which made me implement #240 is to disconnect all - but only when required. They would use their infrastructure and rpc calls to force disconnections. So that should be supported on #240. Disconnecting even if not necessary should probably be avoided - but that seems to be what some people need. :)

gheorghina commented 3 years ago

@gheorghina the original case which made me implement #240 is to disconnect all - but only when required. They would use their infrastructure and rpc calls to force disconnections. So that should be supported on #240. Disconnecting even if not necessary should probably be avoided - but that seems to be what some people need. :)

"when required" means something else for me. In AWS after an RDS restart we can end up with a subset of ECS tasks having their db connections hanging with DB network connection prio to restarts, making them unhealthy since they cannot serve any ecto request.

Therefore I was thinking it is worth checking somehow if dbconnections are actually not working before deciding to disconnect and recreate all at a higher frequency. I am not sure if there are possible scenarios when only a subset of db_Connections are not working, but in my case every time I reproduce the problem it is all or nothing. Screenshot 2021-05-03 at 13 12 29

josevalim commented 3 years ago

If you are using Postgres, then you can ask it to reconnect when it does writes in read-only transactions, which is typically the case in those situations. The latest version supports multiple endpoints to make it swifter too.

gheorghina commented 3 years ago

If you are using Postgres, then you can ask it to reconnect when it does writes in read-only transactions, which is typically the case in those situations. The latest version supports multiple endpoints to make it swifter too.

We have both Postgres and MariaDB, but as per the above, I understand this can happen also with Aurora

josevalim commented 3 years ago

MyXQL is similar with disconnect on error codes but it doesn't have built-in multi-host support. The behaviour is the same though: they should all reconnect automatically if you configure them to do so.

gheorghina commented 3 years ago

MyXQL is similar with disconnect on error codes but it doesn't have built-in multi-host support. The behaviour is the same though: they should all reconnect automatically if you configure them to do so.

ok, then we will wait for the release of #240 Thank you

michaeldg commented 3 years ago

In my opinion #240 it is not a solution for this issue. In a busy application, if there are hundreds or maybe event thousand connections closed at once, and all those connections would reconnect at once, this would cause a hiccup. Also, what if some connecions are busy? Would disconnect_all wait for all these queries to finish running?

Hopefully the same underlying infrastructure can be used to implement this (for MariaDB, MySQL and possibly other RDBMS) necesarry feature.

josevalim commented 3 years ago

Also, what if some connecions are busy? Would disconnect_all wait for all these queries to finish running?

Yes, please check the docs. It disconnects all connections within a time frame once they are used and/or pinged. So you could give it 30 seconds and they would all disconnect randomly within that interval.

I took the time to do the work, so I would appreciate if there is some time put into reading the docs and reviewing the code before stating it isn't an appropriate solution. :)

And for those willing to give it a try, you should be able to call the new API directly, even if you are using Postgrex/MyXQL/etc. If anyone runs into any issue, please let me know!

spencerdcarlson commented 3 years ago

The Postgex.start_link options documentation helped me find what I was looking for

:disconnect_on_error_codes - List of error code atoms that when encountered will disconnect the connection. This is useful when using Postgrex against systems that support failover, which when it occurs will emit certain error codes e.g. :read_only_sql_transaction (default: []);

josevalim commented 3 years ago

Did anyone give the PR above a try? Any comments?

visciang commented 3 years ago

A PR for this would be appreciated. Here is a simple approach based on what @michaeldg proposed. All changes will happen in DBConnection.Connection:

  1. On handle_info connected/after_connect, compute when the connection will go stale
  2. On the pool_update, check if the connection has not gone stale by comparing the current time with the stale time, if it has, return {:disconnect, ..., ...}

Other than that, we need a test. :)

@josevalim I'm not sure it will work as expected.

I've done a quick test triggering the "connection max lifetime" logic in handle_cast :connected/:after_connect. The connected / after_connected events are triggered when a connection is established and then no more, unless a connection is lost / broken. Then the handle_cast(:ping) kicks in periodically and calls the pool_update() function

ref: https://github.com/elixir-ecto/db_connection/compare/master...visciang:master

If we hook there the result is something like: 1) pool starts 2) connections :connected (here we establish the connection_end_of_life ) 3) the pool_ping periodic task runs and calls the pool_update() function. 4) pool_update disconnect the connection when expire

So the connection recycle is driven by the pool_ping period, not by the pool checkout action as suggested by @michaeldg. Alternatively, we should correlate the pool_ping period (idle_interval - defaults 1s) to the max_lifetime, but that's not the a good path to take in my opinion.

(the connection recycle on checkout is the default behavior I've seen in python sqlalchemy, for example - pool_recycle).

I hope the find some time to get a better picture of the pool checkout flow to understand if it's possible to easily link there, and to review https://github.com/elixir-ecto/db_connection/pull/240

josevalim commented 3 years ago

Yes, we need to consider when they are pinged and/or used. The current pool design is descentralized, which means the pool only knows about the connections it has in itself, so that’s the only option available. That’s what was implemented in the PR.

josevalim commented 3 years ago

I don’t think doing it on checkout is beneficial, because you don’t want to check out something only to find out it is not ready. The less latency on checkout, the better.

michaeldg commented 3 years ago

Although I agree with the principal, on MySQL/MariaDB systems creating a connection is extremely fast. A lot of application (PHP for example) create a connection for each web request. This does not add significant latency to the request.

visciang commented 3 years ago

Yes @josevalim , makes sense!

The only little difference between applying the check on checkin or on checkout is that in the first case you can checkin back in the pool a connection still young, then don't use the connection for a long period, then checkout and so work with an old (now expired) connection; while if you apply the logic on checkout you are sure the connection is not after it's deadline when acquired, but it could expire while being used. And for sure, as you said we move and pay the reconnect latency on checkout.

Anyway, closing the connection on ping and checkin / checkout is a best effort to recycle the idle connection as soon as its age is over the limit. I'm personally OK to the checkin approach.

Even more, for me, this feature is only a workaround to the bad process based connection management of Postgres, the worker process behind the long-lived connection can allocate more and more memory and there's no way to workaround this Postgres side (as far as I know).

I'll "play" with #240.

Just a very personal note about the solution implemented in #240.

As a user of the library (directly or indirectly via postgrex) I would like to get this feature "batteries included", without the need to have a separate process (user application space) that calls DBConnection.disconnect_all/3 periodically.

It would be great to have this (optional) periodic task running in the DBConnection app supervisor.

The user experience could be something like:

Postgrex.start_link([
  name: :db,
  hostname: "localhost",
  username: "postgres",
  password: "postgres",
  database: "postgres",
  ...
  pool_size: 5,
  # (or max_lifetime:), where pool_recycle defaults to nil (the current behavior, no connection recycle)
  pool_recycle: 60 * 60 * 1_000
])

What do you think? Shouldn't be too complex (last famous words) or out of the library "responsibility" domain.

josevalim commented 3 years ago

@michaeldg good point! Not the case for PG and TDS, afaik.

josevalim commented 3 years ago

@visciang agreed on making it builtin pool but it probably deserves some proper testing before it is made first class. :)

kentbull commented 2 years ago

A similar feature to the one requested here has been implemented here: #240

For example, if you want connections to live at most for 15 minutes, you can have a separate process that calls DBConnection.disconnect_all/3 in a loop every 15 minutes. Having a function like above is also good if you want to trigger reconnection on specific moments.

How do I get the pool value from the existing, auto-configured pool so that in my own process I can pass it to disconnect_all/3?

josevalim commented 2 years ago

Hi @kentbull, I am not sure I get the question, but that's likely because I no longer have the context in my brain. Can you please try again providing more info on what you want to achieve? Thanks.

kentbull commented 2 years ago

Yes. My goal is to recycle each of the connections in the default connection pool provided by Postgrex to Ecto. Correct me if I am understanding default behavior incorrectly. The reason to recycle all of the connections is to clear out the DB-server-side cache of objects associated with each connection on a regular basis. Each of these objects consumes increasing amounts RAM on the PostgreSQL DB server due to the process model of PostgreSQL thus my client's app that connects with a high connection count (1500+) continually increases the amount of RAM each connection uses as the lifetime of the database connection increases.

I believe I mis-remembered the function spec. It looks like I don't need access to the pool value. It's the conn value instead, the first argument to disconnect_all/3:

def disconnect_all(conn, interval, opts \\ []) when interval >= 0 do
    pool = Keyword.get(opts, :pool, DBConnection.ConnectionPool)
    interval = System.convert_time_unit(interval, :millisecond, :native)
    pool.disconnect_all(conn, interval, opts)
end

I see I just need access to the DB connection conn state. How do I get access to this state? I am thinking I need to manually add things to my application's process tree in application.ex so I can essentially wrap the database connection process with a GenServer that would allow me to get the conn state and pass it into a GenServer that triggers disconnect_all/3. Would this work? Or do you have a better solution in mind?

Additional Context

From what I've read and deduced the high connection count coupled with the long lifetime of connections is one of the causes of my current problem. The RAM of the DB server gets eaten up and crosses a health threshold (95%) and my cloud provider automatically triggers a failover event which then causes cascading failures across the rest of my cluster of containers until the DB server restarts and the DB connection pools are recreated.

Since the API server has high volume and high peak usage with various different and unique database objects then PostgreSQL copies each individual object to the connection cache and thus the RAM consumed by each connection grows over time. For 1500 connections even at 20MB per connection this would be 32GB of RAM. We are still profiling and digging into the precise memory usage per connection in the PostgreSQL server though we know we were able to prevent failovers once we started restarting our Elixir container cluster. Each restart of our Elixir container cluster showed a corresponding decrease in the quantity of DB RAM used and then a further increase of RAM over time.

image

This is why we believe the increasing RAM usage of long lived PostgreSQL connections are the root cause of our high RAM usage on the PostgreSQL server.

Once we implement disconnect_all/3 then we expect to see the RAM usage remain at a reasonable usage and should prevent the DB from failing over due to RAM resource exhaustion.

So, the solution seems to be to recycle database connections every fifteen minutes or so, or whatever the timeframe needs to be.

josevalim commented 2 years ago

@kentbull if you are using Postgrex directly, then you will do this:

{:ok, pid} = Postgrex.start_link(opts)
DBConnection.disconenct_all(pid, 15_000, opts)

If you are using Ecto, you will do this:

%{pid: pid, opts: opts} = Ecto.Adapter.lookup_meta(MyRepo)
DBConnection.disconnect_all(pid, 15_000, opts)
kentbull commented 2 years ago

Thank you @josevalim. We are currently pushing this fix to production and I'll report back on how it worked for us.

kentbull commented 2 years ago

This worked perfectly! Thank you @josevalim

Here's a memory graph of our DB instance.

image

The large spikes on the left are rolling restarts of our cluster, similar to the large spikes from automated failovers.

The small spikes and dips on the left are showing the results of implementing the DBConnection.disconnect_all/3 code.

As far as I am concerned this problem is solved.

Here's a version of the code (for posterity)


defmodule MyApp.DatabaseConnectionRecycler do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def init(state) do
    schedule_recycling()
    {:ok, state}
  end

  def handle_info(:recycle, state) do
    %{pid: pid, opts: opts} = Ecto.Adapter.lookup_meta(MyRepo)
    DBConnection.disconnect_all(pid, 15 * 60 * 1000, opts)
    schedule_recycling()

    {:noreply, state}
  end

  defp schedule_recycling do
    Process.send_after(self(), :recycle, 60 * 60 * 1000)
  end
end
josevalim commented 2 years ago

@kentbull I appreciate the feedback very much! Btw, if you are interested, it would be awesome if you could write a blog post about it. I bet people would be happy to read it. :)

bartekupartek commented 2 years ago

As I can see on graph the memory usage is growing in smaller intervals which is preventing the db server from failing over, but I'm curious how does it affects transactions that are in progress? Is the disconnect_all reconnecting all connections or only idle ones? if all wouldn't it be possible to disconnect only idle?

Seeking the real reason of growing memory size it looks like Postgres caches queries per connection session and its garbage collector doesn't do anything with it. Please correct me if I'm wrong, when we restart connection it has to cold start it and begin caching over again from first occurrence of all new queries? Assuming we would have small intervals of connections restarts wound't prepare: :unnamed be an alternative with an additional overhead of single roundtrip combined with prepare and execute so it wound't cache prepared queries at the connection level, so in long connection lifespan preventing the cause of OOM? Some another alternatives?

josevalim commented 2 years ago

Is the disconnect_all reconnecting all connections or only idle ones?

I have clarified this above and the docs also cover it, but it reconnects as connections are checked-in or pinged, over a time interval. So it won't interrupt any ongoing operation and it also won't disconnect everything immediately. It spaces them out over a controlled time period.

Please correct me if I'm wrong, when we restart connection it has to cold start it and begin caching over again from first occurrence of all new queries?

Correct. You could use prepare: :unnamed, but this means no prepared statement caching whatsoever. For a busy server, there is a lot of cache hits you can get in an interval of 15 minutes.

bartekupartek commented 2 years ago

thanks a ton, prepare: :unnamed has indeed to many cons like performance or safety (SQL injections), it's worth to mention that in database with many schemas memory usage is multiplied even for the same statements and it would be hard to workaround, my question is if would it be technically possible to share statement caching between connections for the same dbname?