elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
309 stars 112 forks source link

Dyno load (CPU load) spikes with db_connection >= 2.2.2 #265

Closed jesseahouser closed 2 years ago

jesseahouser commented 2 years ago

Issue Description

Following some dependency package upgrades, our team observed dyno load spikes a few times per day. Prior to these upgrades, we have experienced years of consistent, normal dyno load levels.

Context and Details

This Elixir app is hosted in a Heroku Private Space, and dyno load is their measure of CPU load (https://devcenter.heroku.com/articles/metrics#dyno-load). Dyno load maxes out for approximately 15-30 minutes with each spike, then returns to normal load on its own. We are running two web servers, and the spikes affect both, are unsynchronized, and are not dependent on traffic. This Elixir app maintains connections to three different Postgres databases with a pool size of 40 each.

To Reproduce

We began an investigation to narrow down which package(s) might be related to the issue. This revealed that the spikes:

We can reproduce this behavior by changing only the db_connection version to >= 2.2.2 in mix.lock.

Screenshot

dyno_load_db_connection
josevalim commented 2 years ago

2.2.2 contains only a commit that makes sure all connections are pinged within an idle_interval. You can see the initial report here: https://github.com/elixir-ecto/db_connection/pull/216

Are you sure that you need a pool of 40 connections? If they are not being used, it means that indeed you will be pinged every second. You can consider either increasing the idle interval or decreasing the pool size. We can also add an idle threshold configuration if you really believe the pool size is justified.

josevalim commented 2 years ago

Also: awesome job on isolating the issue and great report!

jesseahouser commented 2 years ago

@josevalim Thank you for your reply and suggestions. We have an update to report.

Considerations

The following currently-available options were considered:

  1. Decrease pool_size This is not an attractive option for the use case as those resources are projected to be needed during peak time periods.
  2. Increase idle_interval This is a lower risk option for the use case. db_connection <= 2.2.1 (pinging a single idle connection per idle_interval) was working well. The app’s connections to Postgres databases were not negatively impacted in the way that #216 described. We hypothesized that increasing the idle_interval by an order of magnitude or two would not negatively impact performance and may achieve the intended effect of eliminating the dyno load spikes we observed with db_connection >= 2.2.2.

Actions and results

  1. While on db_connection 2.2.1, we increased the idle_interval from the default (1000 ms) to 100000 ms using an environment variable — similar to the way that we specify pool_size. This had no discernable adverse effect.
  2. We then upgraded to db_connection 2.4.2, keeping idle_interval at 100000 ms, and monitored dyno load. Over the past three days, we have observed no dyno load spikes or negative performance impact. If dyno load spikes are still happening, they’re short-lived enough to be harmless.

Conclusion

We’ll continue to monitor dyno load and performance, but thus far we have reasonable confidence that this solution (increase idle_interval) meets our current needs. If an idle_limit option is implemented in a future release per c1791c7, it would offer an additional level of control over pinging idle connections. Given our experience, we see this as a benefit.

Thank you again for your communication and contributions!