elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
306 stars 113 forks source link

Adding in :idle_threshold configuration parameter to limit max idle time of connections #216

Closed zackmaril-pd closed 4 years ago

zackmaril-pd commented 4 years ago

Hello! We've had an issue at my company with idle connections timing out due to database connection resets. We did an investigation into this, identified the problem and we are opening this PR to see if you are interested in the solution we found. We've been testing this out internally and it is working well. We wanted to make sure we communicated with you early on to see if you were interested in accepting these changes.

Summary

We are proposing the addition of an idle_threshold configuration setting to the db_connection library that allows applications to specify a maximum amount of time that any connection in the connection pool can be idle. There's a non obvious interaction happening between some of the configuration settings that causes these errors and the addition of this new setting prevents these misconfigurations from occurring.

Motivation and Context

Over the past year or so, various teams have run into an issue with elixir applications that connect to MySQL databases. When an application wasn't actively querying the databases and the connections are allowed to idle for too long, the connections would start to be disconnected from the database.

So, what we were seeing was a bunch of errors about connections being reset because it seemed like the connections had been hitting the wait_timeout limit set by our database. We experimented and found that if the number of connections multiplied by the idle interval was greater than the wait_timeout, than we would start seeing the db_connection resets happening often. This did not meet our expectations at all. When we first read through the codebase of db_connection, we were under the impression that every idle interval all of the idle connections would be refreshed. Instead, it seemed like only one connection was getting refreshed during each idle interval, no matter how many connections were in the pool.

After tracing the function calls, we did see only a single connection (or rarely two) getting refreshed once per interval. So, let's say that we have a pool of connections. The idle_interval timer would go off and drop_idle would be called. The arguments drop_idle would be passed were the following:

The first time drop_idle is called, everything works as expected. The timestamp of the first element in the queue is compared to the timestamp, last_queued_in_native, that was passed in to drop_idle and it is found that the first connection in the queue was idle. So that connection gets refreshed, and the drop_idle recursively calls itself with those same arguments.

Here's the problem we ran into. The last_queued_in_native timestamp that is used is the timestamp of the first element in the queue. The queue is ordered by age, oldest to youngest, so that one connection will be refreshed but none of the other connections will be. The logic used to determine whether a connection will be refreshed depends on the guard statement below:

when queued_in_native <= last_queued_in_native and status == :ready ->

queued_in_native is the timestamp of the last time the first element in the queue was last used or refreshed. The first time drop_idle gets called, queued_in_native is equal to last_queued_in_native, so the connection is refreshed. The second time drop_idle gets called, queued_in_native is larger than last_queued_native, because these are monotonically increasing timestamps and the second item in the queue is older than the first item in queue. This means that only one connection gets refreshed every idle interval (or sometimes two at the very beginning if the timestamps of the two connections happen to be exactly the same).

So, the behavior that we would expect from the logic above matches what we observed experimentally. Only one connection can ever get refreshed each idle interval, so if the number of connections you have multiplied by how long it takes to refresh each one (idle_interval) is greater than the wait_timeout, then you'll start seeing consistent, but not deterministic, timeouts of connections due to them going stale and getting closed by the database.

idle_threshold

We think that the reason this happens is because it isn't clear what the criteria are for a connection to be considered idle and in need of a refresh. Our response to this is to introduce a clear, configurable definition of what it means for a connection to be idle. No more thinking about the number of connections or length of the interval. All an application developer would need to consider is the maximum time they are comfortable having their connections idle for. The problematic guard statement from above would be changed to something like the following:

when (time - queued_in_native) > idle_threshold and status == :ready ->

This takes the amount of time the thread has been idle (the difference between time and queued_in_native) and if that number is larger than the idle_threshold, we refresh the connection. In our experiments locally, this has shown to work with many hundreds of connections and relatively long idle_intervals. It's also much easier to document and understand and matches some people's expectations about how a connection pooling library should behave.

This PR contains the code we've put together so far to implement idle_threshold. It's working internally for us and hasn't run into any problems yet.

josevalim commented 4 years ago

Hi @zackmaril-pd, thank you for the great write up. 💯 This is a bug, :idle_interval was supposed to behave as you described. After reading your detailed report, I think the solution is rather to remove these two lines:

https://github.com/elixir-ecto/db_connection/blob/d1a5a6575469228b7be41ecc89230a18906c83e4/lib/db_connection/connection_pool.ex#L132-L133

Imagine we have two connections, they were checked in at timestamps 500 and 600 (ms).

At moment 0, we emit the first interval check. We will check at interval 1000 if there is any timestamp older than 0. When interval 1000 arrives, we see that 500 is more than 0, so we don't drop anything and we schedule another check to see if there is a connection older than 500 - the last queued in time. Then, by time the next interval check happens at timestamp 2000, we drop only 500.

However, what we want to happen is this: at moment 0, we emit the first interval check. We will check at interval 1000 if there is any timestamp older than 0. When interval 1000 arrives, we see that 500 is more than 0, so we don't drop anything and we schedule another check to see if there is a connection older than 1000 - the current time.

Removing those lines will force that.

I will push a commit soon and I would really appreciate if you could try it out and let us know. If it works, I will ship a new release. The commit looks a bit more elaborated than removing two LOC but that's because I slightly refactored the code so we no longer pass three arguments to start_idle.

I was also able to verify the broken behaviour in tests. I could see that 99% of the time we would never ping two connections. With my fix, this happens rather 99% of the time. Unfortunately, it is hard to write a test because of the 1% so I dropped a comment.

zackmaril-pd commented 4 years ago

Great!! I've tested your fix on my test cases and not found any issues there. I've also asked other teams internally to test your fix out and will send a message if anything comes up for them.

zackmaril-pd commented 4 years ago

Update: a team tried using this fix and it clearly fixed the issue for them. I also saw that 2.2.2 was released and so we are recommending that everyone use that release moving forward. Thank you for fixing this so quickly!

josevalim commented 4 years ago

Awesome, glad to hear it is working well!