arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
188 stars 57 forks source link

Redis connection timeout issue in sidekiq_alive (occasionally) during shutdown of sidekiq #100

Closed koconnor-ampion closed 6 months ago

koconnor-ampion commented 10 months ago

Hi All,

We are running sidekiq_alive 2.3 with sidekiq 7.0.8 (and Rails 6.1.7.4). This is running outside Kubernetes in Google Cloud Platform's AppEngine.

We have our deployment set up so when a new version is deployed, the current (soon to be old) version running sidekiq is put into quiet mode and left to finish any jobs it is currently running. This action causes sidekiq_alive to call unregister_current_instance.

Occasionally, the unregister_current_instance will fail with a SidekiqAlive: Timeout::Error. This is not every time, but it does get picked up by our Honeybadger integration and creates questions about the system health (and leaves the old host queues in place since our shutdown_callback does not get a chance to run).

This really isn't the end of the world as you can probably guess, but I think that there is a relatively easy solution.

The problem appears to be related to the concurrency setting for sidekiq and the number of connections (defined by the concurrency in sidekiq_alive) available in the capsule created for sidekiq_alive.

If the sidekiq_alive concurrency (fixed to 2 at the moment) is less than the concurrency for sidekiq (very often the case), the worker threads of sidekiq will allocate all of the sidekiq_alive connections to the various sidekiq threads.

Normally this doesn't cause any problems, the connections are used and returned to the pool quite quickly and the 1 second timeout to retrieve one from the pool is fine. However, it is possible for the unregister_current_instance to end up starved of connections and timeout every once in a while.

If you agree that this is a reasonable hypothesis, I believe that there is a relatively simple solution. That would be to allow the concurrency for sidekiq_alive to be configurable (yet still default to 2) so a system could set the sidekiq_alive concurrency to be one more than the number of the system's sidekiq concurrency. If this is done, there will always be a Redis connection available for shutdown.

Environments that feel their Redis connections are a constraint can leave the default and run as they currently do.

The pool used by sidekiq is mperham's connection_pool gem and it will only create new connections when they are first called for, so the 'extra' connection for shutdown is not created until needed.

I can provide a simple PR for this if desired.

Kevin

diff --git a/lib/sidekiq_alive.rb b/lib/sidekiq_alive.rb
index 5f104f6..7879c02 100644
--- a/lib/sidekiq_alive.rb
+++ b/lib/sidekiq_alive.rb
@@ -20,7 +20,7 @@ module SidekiqAlive

           if Helpers.sidekiq_7
             sq_config.capsule(CAPSULE_NAME) do |cap|
-              cap.concurrency = 2
+              cap.concurrency = config.concurrency
               cap.queues = [current_queue]
             end
           else
diff --git a/lib/sidekiq_alive/config.rb b/lib/sidekiq_alive/config.rb
index bb48896..a05e618 100644
--- a/lib/sidekiq_alive/config.rb
+++ b/lib/sidekiq_alive/config.rb
@@ -15,7 +15,8 @@ module SidekiqAlive
                   :server,
                   :custom_liveness_probe,
                   :logger,
-                  :shutdown_callback
+                  :shutdown_callback,
+                  :concurrency

     def initialize
       set_defaults
@@ -33,6 +34,7 @@ module SidekiqAlive
       @server = ENV.fetch("SIDEKIQ_ALIVE_SERVER", "webrick")
       @custom_liveness_probe = proc { true }
       @shutdown_callback = proc {}
+      @concurrency = Integer(ENV.fetch("SIDEKIQ_ALIVE_CONCURRENCY", 2), exception: false) || 2
     end

     def registration_ttl
andrcuns commented 10 months ago

@koconnor-ampion

the worker threads of sidekiq will allocate all of the sidekiq_alive connections to the various sidekiq threads

Is this true? I was under impression that a capsule specific pool would only be used by that pool (in our case I think we would technically need just one connection because only 1 sidekiq_alive job runs at a time). But maybe I am mistaken.

I am experiencing this same issue occasionally as well (I initially added 2 threads by default just to be safe), but my app has plenty of connections.

But in any case, I'm supportive of adding this as configurable parameter.

koconnor-ampion commented 10 months ago

@andrcuns I'll admit I was surprised too. The debugging of it was rather a bear. I ended up adding a bit of logging to the connection pool gem (in frustration to be sure).

When two connections were in the pool, I could see each of them get pulled out as the sidekiq threads were started that were in support of the Google App Engine health check (which I had configured to use the same check as sidekiq_alive). This was a quick check as you know, so the connections got returned and usually at shutdown there was no problem for the quiet handler. Occasionally, though, both were in use, and the quiet handler hit the timeout.

To be honest, the insight came when I started playing with the two concurrencies and found that if the the sidekiq concurrency was more than the sidekiq_alive concurrency, all the connections in the sidekiq_alive capsule's pool got consumed. If I made the sidekiq_alive concurrency, one more than the sidekiq_concurrency, there was always one left for the quiet handler, even if it was never needed. (the extra connection would never be created either, due to the nature of the connection pool gem)

So it appears to be our use of the same alive test for Google App Engine that made us experience this occasional connection starvation. Without that, it is very unlikely for anyone to run into this and the default of 2 would be fine.

Thanks, though, for being supportive of the additional configuration. I'll put that diff above in a PR for you to look at.

Kevin

andrcuns commented 10 months ago

So it appears to be our use of the same alive test for Google App Engine that made us experience this occasional connection starvation. Without that, it is very unlikely for anyone to run into this and the default of 2 would be fine.

That's the main problem that confuses me, because I run in to the same problem with a pretty barebones app that is running idle most of the time. But what I pinpointed is that I get this error during app shutdown. So it might be same error but different reason. I have some suspicion that our lifecycle here in sidekiq_alive doesn't work correctly and it tries to acquire connection when sidekiq itself already shutdown the pool, but this is just a guess.

koconnor-ampion commented 10 months ago

That's the main problem that confuses me, because I run in to the same problem with a pretty barebones app that is running idle most of the time. But what I pinpointed is that I get this error during app shutdown. So it might be same error but different reason (I have some suspicion that our lifecycle here in sidekiq_alive) doesn't work correctly and it tries to acquire connection when sidekiq itself already shutdown the pool, but this is just a guess.

hmm, that is an interesting theory. I admit, I did not check the shutdown of the pool, but to be sure, I expected the capsule to still be around (and that certainly appears to be the case) and that the pool would not be shutdown until the capsule was deallocated.

I'll keep an eye out for that and try running a few tests looking at that specifically. Maybe I'll catch something interesting.

andrcuns commented 10 months ago

Anyhow, feel free to submit PR to make this option configurable to check if it fixes your issue

koconnor-ampion commented 10 months ago

Thanks, @andrcuns, there you are, ready for review, I believe.

andrcuns commented 8 months ago

@koconnor-ampion Did the additional configuration solve the issue for You? I am still getting the occasional timeout error in with 2 threads configured.

koconnor-ampion commented 8 months ago

I feel like we are seeing it less, but it still happens. At least for us as an organization, it is not a very high priority. We have only seen it happen when we are shutting down an instance. The effect is that the queue for that sidekiq_alive instance is never deleted from redis (and we get a Honeybadger report). We can go in and clean up the old queues, but the host name part of the queue is a non-conventional string that is picked up from the GCP AppEngine instance where the sidekiq_alive instance runs so we have to be careful with the deleting process. I'm rather disappointed that this did not work. The logging I had temporarily added to the pool seemed to indicate that it would only pull one per thread and turn them in when it finished. Adding the extra one seemed to be the answer to getting an extra that would be available even if all the other threads were busy. Frustrating, but really just a niggling issue that pulls my attention every once in a while when I get annoyed with those errors or someone says: Hey, what about these? I'd really be interested if you find a solution as well. Kevin

On Fri, Dec 1, 2023 at 1:18 AM Andrejs @.***> wrote:

@koconnor-ampion https://github.com/koconnor-ampion Did the additional configuration solve the issue for You? I am still getting the occasional timeout error in with 2 threads configured.

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1835531319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5DGM2AG6362RGYP6SLYHFZCDAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZVGUZTCMZRHE . You are receiving this because you were mentioned.Message ID: @.***>

--

Kevin O'Connor Principal DevOps Engineer M: 617-834-1291

[image: email-footer-logos.jpg (1000×120)]

STATEMENT OF CONFIDENTIALITY: The information contained in this message and any attachments are intended solely for the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, or responsible for delivering the e-mail to the intended recipient, you have received this message in error. Any use, dissemination, forwarding, printing, or copying is strictly prohibited. Please notify Ampion immediately at @.*** and destroy all copies of this message and any attachments.

andrcuns commented 8 months ago

We have only seen it happen when we are shutting down an instance

Yeah, this is exactly when I see this error as well. But because I have a continuous deployment setup, it does end up being triggered rather often.

I have a suspicion this is due to performing instance unregister and shutdown callback twice, I think it shouldn't be the case: https://github.com/arturictus/sidekiq_alive/blob/9d26a9bbe1701726b945e679ae81c39c94cbe0b3/lib/sidekiq_alive.rb#L42

koconnor-ampion commented 8 months ago

I have a suspicion this is due to performing instance unregister and shutdown callback twice,

You may be onto something there... We also have a deployment workflow (triggered for production, but continuous for staging). When we deploy a new version, we do a quiet of the sidekiq instances for the old version. Then we have a process that runs 1/hr and interrogates all the instances of quieted versions and if all the instances of a quieted version are no longer busy, they are shut down. I do note that both shutdown and quiet call the unregister and shutdown callback. Ours are fairly well separated in time, but maybe yours are not.

On Fri, Dec 1, 2023 at 8:28 AM Andrejs @.***> wrote:

We have only seen it happen when we are shutting down an instance

Yeah, this is exactly when I see this error as well. But because I have a continuous deployment setup, it does end up being triggered rather often.

I have a suspicion this is due to performing instance unregister and shutdown callback twice, I think it shouldn't be the case: https://github.com/arturictus/sidekiq_alive/blob/9d26a9bbe1701726b945e679ae81c39c94cbe0b3/lib/sidekiq_alive.rb#L42 https://github.com/arturictus/sidekiq_alive/blob/9d26a9bbe1701726b945e679ae81c39c94cbe0b3/lib/sidekiq_alive.rb#L42

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1836119718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5BUA3PLYXDNCDEQIL3YHHLPPAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGEYTSNZRHA . You are receiving this because you were mentioned.Message ID: @.***>

--

Kevin O'Connor Principal DevOps Engineer M: 617-834-1291

[image: email-footer-logos.jpg (1000×120)]

STATEMENT OF CONFIDENTIALITY: The information contained in this message and any attachments are intended solely for the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, or responsible for delivering the e-mail to the intended recipient, you have received this message in error. Any use, dissemination, forwarding, printing, or copying is strictly prohibited. Please notify Ampion immediately at @.*** and destroy all copies of this message and any attachments.

koconnor-ampion commented 8 months ago

Your email made me look at sidekiq this morning. I noticed that there was a recent (3 days ago) merge having to do with redis connection pooling https://github.com/sidekiq/sidekiq/pull/6118. I'm pretty sure I recall sidekiq_alive creates its own capsule, so maybe this is a red herring.

On Fri, Dec 1, 2023 at 8:36 AM Kevin O'Connor @.***> wrote:

I have a suspicion this is due to performing instance unregister and

shutdown callback twice,

You may be onto something there... We also have a deployment workflow (triggered for production, but continuous for staging). When we deploy a new version, we do a quiet of the sidekiq instances for the old version. Then we have a process that runs 1/hr and interrogates all the instances of quieted versions and if all the instances of a quieted version are no longer busy, they are shut down. I do note that both shutdown and quiet call the unregister and shutdown callback. Ours are fairly well separated in time, but maybe yours are not.

On Fri, Dec 1, 2023 at 8:28 AM Andrejs @.***> wrote:

We have only seen it happen when we are shutting down an instance

Yeah, this is exactly when I see this error as well. But because I have a continuous deployment setup, it does end up being triggered rather often.

I have a suspicion this is due to performing instance unregister and shutdown callback twice, I think it shouldn't be the case: https://github.com/arturictus/sidekiq_alive/blob/9d26a9bbe1701726b945e679ae81c39c94cbe0b3/lib/sidekiq_alive.rb#L42 https://github.com/arturictus/sidekiq_alive/blob/9d26a9bbe1701726b945e679ae81c39c94cbe0b3/lib/sidekiq_alive.rb#L42

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1836119718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5BUA3PLYXDNCDEQIL3YHHLPPAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGEYTSNZRHA . You are receiving this because you were mentioned.Message ID: @.***>

--

Kevin O'Connor Principal DevOps Engineer M: 617-834-1291

[image: email-footer-logos.jpg (1000×120)]

STATEMENT OF CONFIDENTIALITY: The information contained in this message and any attachments are intended solely for the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, or responsible for delivering the e-mail to the intended recipient, you have received this message in error. Any use, dissemination, forwarding, printing, or copying is strictly prohibited. Please notify Ampion immediately at @.*** and destroy all copies of this message and any attachments.

andrcuns commented 8 months ago

That was just a documentation fix, so I doubt it is related

koconnor-ampion commented 8 months ago

Well, that will teach me to look into the PR details before getting excited!

On Fri, Dec 1, 2023 at 9:00 AM Andrejs @.***> wrote:

That was just a documentation fix, so I doubt it is related

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1836166362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5DQA5ZRH7SRWUO4YTTYHHPIVAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGE3DMMZWGI . You are receiving this because you were mentioned.Message ID: @.***>

andrcuns commented 8 months ago

I'm fairly certain this issue is related to how we handle sidekiq signals.

When sidekiq is shutting down, I think it might be shutting down the capsule connection pool:

[2023-12-02 10:57:24 +0200] INFO  -- Shutting down
[2023-12-02 10:57:24 +0200] INFO  -- Scheduler exiting...
[2023-12-02 10:57:24 +0200] INFO  -- Terminating quiet threads for default capsule
[2023-12-02 10:57:24 +0200] INFO  -- Terminating quiet threads for sidekiq-alive capsule
[2023-12-02 10:57:24 +0200] INFO  -- Scheduler exiting...
[2023-12-02 10:57:24 +0200] DEBUG -- Firing quiet event
[2023-12-02 10:57:24 +0200] INFO  -- Shutting down sidekiq-alive!

Because sidekiq-alive will still try to use sidekiq-alive capsule connection pool to perform shutdown operations, it can end up not having any connection hence the error: https://github.com/mperham/connection_pool#shutdown. I think the solution is to either use internal capsule for shutdown operations or maybe it's possible to reload (I think the former is probably better and more correct).

koconnor-ampion commented 8 months ago

Good sleuthing. I had been looking at when connections were pulled and returned to the pool. I hadn't thought of looking at when they were shut down. I'm guessing that the reload of a shutdown pool is not the best solution as well. An alternative to the internal capsule might be to pull a connection after startup and hold onto it as the doc indicates existing checkouts will work but that does seem rather wasteful. The internal capsule is more likely the better path.

On Sat, Dec 2, 2023 at 4:05 AM Andrejs @.***> wrote:

I'm fairly certain this issue is related to how we handle sidekiq signals.

When sidekiq is shutting down, I think it might be shutting down the capsule connection pool:

[2023-12-02 10:57:24 +0200] INFO -- Shutting down[2023-12-02 10:57:24 +0200] INFO -- Scheduler exiting...[2023-12-02 10:57:24 +0200] INFO -- Terminating quiet threads for default capsule[2023-12-02 10:57:24 +0200] INFO -- Terminating quiet threads for sidekiq-alive capsule[2023-12-02 10:57:24 +0200] INFO -- Scheduler exiting...[2023-12-02 10:57:24 +0200] DEBUG -- Firing quiet event[2023-12-02 10:57:24 +0200] INFO -- Shutting down sidekiq-alive!

Because sidekiq-alive will still try to use sidekiq-alive capsule connection pool to perform shutdown operations, it can end up not having any connection hence the error: https://github.com/mperham/connection_pool#shutdown https://github.com/mperham/connection_pool#shutdown. I think the solution is to either use internal capsule for shutdown operations or maybe it's possible to reload (I think the former is probably better and more correct).

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1837094858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5FPPR6FIHA7YX2WHWTYHLVLVAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGA4TIOBVHA . You are receiving this because you were mentioned.Message ID: @.***>

--

Kevin O'Connor Principal DevOps Engineer M: 617-834-1291

[image: email-footer-logos.jpg (1000×120)]

STATEMENT OF CONFIDENTIALITY: The information contained in this message and any attachments are intended solely for the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, or responsible for delivering the e-mail to the intended recipient, you have received this message in error. Any use, dissemination, forwarding, printing, or copying is strictly prohibited. Please notify Ampion immediately at @.*** and destroy all copies of this message and any attachments.

andrcuns commented 6 months ago

Should be fixed by #109

koconnor-ampion commented 6 months ago

Awesome, thanks for the update! We will be taking a look.

On Wed, Feb 14, 2024 at 9:56 PM Andrejs @.***> wrote:

Should be fixed by #109 https://github.com/arturictus/sidekiq_alive/pull/109

— Reply to this email directly, view it on GitHub https://github.com/arturictus/sidekiq_alive/issues/100#issuecomment-1945286494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTSM5CTJ43ZGLG223JUK63YTV2PBAVCNFSM6AAAAAA5WJK64GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBVGI4DMNBZGQ . You are receiving this because you were mentioned.Message ID: @.***>

koconnor-ampion commented 5 months ago

We have had this running for a couple of weeks now and have not seen this issue. Thanks @andrcuns