arturictus / sidekiq_alive

Liveness probe for Sidekiq in Kubernetes deployments
MIT License
194 stars 57 forks source link

Remove found_keys.blank? for paginated results #59

Closed lucas-aragno closed 2 years ago

lucas-aragno commented 2 years ago

This PR is somewhat related to this issue. We were having this problem on our environments and we tried the approach described on this comment.

We found out that the deep_scan method doesn't work as expected.

It looks like if it doesn't find it on the "first page" of the scan it returns empty:

deep_scan("SIDEKIQ_REGISTERED_INSTANCE::*")
===== next cursor 3968  found keys []
=> [

(we added the puts "===== next cursor #{next_cursor} found keys #{found_keys}" to the method)

since next_cursor wasn't "0" and the keys weren't found it just halted and returned [] which is strange since it should re-start if found_keys is blank?

nevertheless I think the only condition for deep_scan should be that the cursor is "0" if we base ourselves on the SCAN method description from redis:

SCAN family functions do not guarantee that the number of elements returned per call are in a given range. The commands are also allowed to return zero elements, and the client should not consider the iteration complete as long as the returned cursor is not zero.

besides that it seems to only return results if they are found on the first iteration, looks like it will also break if needs to perform further iterations. You can see this by updating a spec on sidekiq_alive_spec.rb

  it '::registered_instances' do
    (1..100).step(5) do |n|
      SidekiqAlive.redis.set("#{n}-value",
      Time.now.to_i,
      ex: 60)
    end
    expect(SidekiqAlive.registered_instances).to eq []
    SidekiqAlive.register_current_instance
    expect(SidekiqAlive.registered_instances.count).to eq 1
    expect(SidekiqAlive.registered_instances.first).to include 'test-hostname'
  end

this will generate some extra keys on redis before registering and looking up the instance. This fails with the following error

=== found keys []  next cursor "10"
::registered_instances (FAILED - 1)   

Failures:                                                                                                                                                                                                                                     1) SidekiqAlive ::registered_instances
      Failure/Error: return keys if next_cursor == "0" || found_keys.blank?       

Because the key was after the first 10, next_cursor == "0" was false so it tried to found_keys.blank? of [] and Arrays do not support the .blank? method.

So our solution was to take the || found_keys.blank? altogether and seems to be performing well so far. I think that member of the condition introduces more issues than it fixes since most folks will get the same output if getting the results on the first page and this will better support multi paged scans.

arathunku commented 2 years ago

We're also running into this issue and this PR resolves it.

arathunku commented 2 years ago

After almost 2 weeks of testing this patch on production, we're running into infinite(look like) deep scan calls, Redis v4 and we had to roll it back.

irb(main):028:0> SidekiqAlive.registered_instances
/usr/src/app/vendor/bundle/ruby/3.0.0/gems/redis-4.5.1/lib/redis/connection/ruby.rb:36:in `read': stack level too deep (SystemStackError)
irb(main):029:1* module SidekiqAlive
irb(main):030:2*     def self.deep_scan(keyword, keys = [], cursor = 0)
irb(main):031:2*     next_cursor, found_keys = *redis { |r| r }.scan(cursor, match: keyword)
irb(main):032:2*     keys += found_keys
irb(main):033:2*     return keys if next_cursor == "0" || found_keys.blank?
irb(main):034:2*     deep_scan(keyword, keys, next_cursor)
irb(main):035:1*     end
irb(main):036:1*
irb(main):037:0> end
=> :deep_scan
irb(main):038:0> SidekiqAlive.registered_instances
=> []

This is because we have a ton of keys inside redis, and it's running out of stack.

Changing this into a loop, not a recursive call should resolve the issue. I'll be doing more tests on that.

lucas-aragno commented 2 years ago

@arathunku I actually found the same issue as you after testing this fix and came up with this (1000 worked for us but maybe you want to make it configurable):

    loop do
      cursor, found_keys = SidekiqAlive.redis.scan(cursor, match: keyword, count: 1000)
      keys += found_keys
      break if cursor.to_i == 0
    end
    keys

its working alright for us. I may update this PR with the fix today. I havent done that yet bc I haven't see any participation from mantainers tbh

lmk if that helps.

arathunku commented 2 years ago

It will 100% help! Thanks for the snippet, I'll test it too, it shouldn't run out of Stack anymore :) I think the test instance I was running it on had over 60k keys

We're running on a custom fork with additional changes due to: https://github.com/arturictus/sidekiq_alive/pull/40#discussion_r752892803

lucas-aragno commented 2 years ago

If this works for you lmk and I'll update this PR and see if we can get it merged on master!

arathunku commented 2 years ago

@lucas-aragno we've been using it on test and prod instances for over a week now, 0 problems :)

lucas-aragno commented 2 years ago

@arathunku perfect! ill update this pr soon.

arturictus commented 2 years ago

Hi @lucas-aragno and @arathunku, I've been out for some time now, sorry for not answering what is the conclusion for this PR?

The code that is working for both is this one?

    loop do
      cursor, found_keys = SidekiqAlive.redis.scan(cursor, match: keyword, count: 1000)
      keys += found_keys
      break if cursor.to_i == 0
    end
    keys

Which means that this PR is invalid or has to be updated if I understand correctly

Thanks for the help

lucas-aragno commented 2 years ago

Hey @arturictus ! I still have to update this pr with that last code. I'll do that today and re-request the review. Thank you!

lucas-aragno commented 2 years ago

@arathunku @arturictus I just rebased and pushed the new changes. Could you folks take a look?

arathunku commented 2 years ago

Looks OK to me, that's the version we've been using on prod since January. Thank you @lucas-aragno and @arturictus for helping with this

lucas-aragno commented 2 years ago

Great! then I think is up to @arturictus now :)

arturictus commented 2 years ago

Thanks @lucas-aragno and @arathunku. Merging and releasing now