contribsys / faktory_worker_ruby

Faktory worker for Ruby
GNU Lesser General Public License v3.0
214 stars 31 forks source link

Jobs are not being fetched for certain queues #65

Open motymichaely opened 2 years ago

motymichaely commented 2 years ago

We have an issue in which jobs are not being processed.

We are running a Faktory worker in Ruby 2.7.2 using a configuration file:

$ bundle exec faktory-worker -C config/faktory/worker.yml

The worker.yml file looks like this:

:queues:
  - jobs
:concurrency: 10
:timeout: 25

An example: JobWorker class:

class JobWorker
  include Faktory::Job

  def perform(arg1, arg2)
    puts "Arg1: #{arg1}"
    puts "Arg2: #{arg2}"
  end
end

Pushing a job to Faktory:

jid = SecureRandom.hex(12)
Faktory::Client.new.push(
  jid: jid,
  queue: "jobs",
  jobtype: "JobWorker",
  args: ["arg1", "arg2"]
)

The job is stuck in the jobs queue. I would expect the job to be processed by the worker.

I suspect it's related to the fact the JobWorker doesn't have a queue option but our requirement is to have JobWorker processed by several processes from different queues.

What am I missing?

mperham commented 2 years ago

I don't see anything obviously wrong. 🤷🏼‍♂️

motymichaely commented 2 years ago

@mperham - How can I debug this? Is there any way to get a list of jobs per queue?

mperham commented 2 years ago

I'm not sure I understand? You can go to the Queues tab and click the name of the queue to view the currently enqueued jobs.

You haven't said which Faktory or FWR version you are using - have you checked the changelogs and/or upgraded to the latest to see if that fixes your issue?

motymichaely commented 2 years ago

@mperham , FWR version is 1.1.0 (the latest) and the Faktory server version is Faktory Enterprise 1.6.0 - latest versions.

Faktory server web UI - I see the queue and the jobs.
But I suspect the issue is with the client, not fetching the jobs.

I was trying to call get_track for one of the jobs in the queue, and it seems like its state is in unknown state although I can see it in the queue (through the web UI):

Faktory::Client.new(debug: true).get_track("1ef1c6777b9be3585acaf187")
< +HI {"v":2,"i":4545,"s":"7aa6160a43951fff"}
> HELLO {"wid":"","hostname":"f61e7a94-387c-40a6-8fae-3c50edcd009b","pid":3,"labels":["ruby-2.7.2"],"username":"test","v":2,"pwdhash":"redacted"}
< +OK
> TRACK GET 1ef1c6777b9be3585acaf187
< $68
=> {"jid"=>"1ef1c6777b9be3585acaf187", "state"=>"unknown", "updated_at"=>""}

image

motymichaely commented 2 years ago

@mperham - Any chance it's related to throttles? It's set for the queue to be 5. I see 0 Free and 0 Taken in the UI.

Is there any way to detect if there are some orphan throttled workers?

mperham commented 2 years ago

Absolutely it could be throttles. You didn't mention that the queue in question had a throttle. 0 Free and 0 Taken is a bad sign, I believe those two numbers should always add up to 5 in your case. It's possible the throttle tokens are leaking somehow. What version of Faktory are you running?

motymichaely commented 2 years ago

Enterprise 1.6.

How can I debug leaked throttle tokens?

On Thu, Oct 21 2021 at 10:07 PM, Mike Perham @.***> wrote:

Absolutely it could be throttles. You didn't mention that the queue in question had a throttle. 0 Free and 0 Taken is a bad sign, I believe those two numbers should always add up to 5 in your case. It's possible the throttle tokens are leaking somehow. What version of Faktory are you running?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/contribsys/faktory_worker_ruby/issues/65#issuecomment-948919232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANKFFM5GVJB6JZUQ3A23SLUIBQFJANCNFSM5GLFM2MQ .

mperham commented 2 years ago

The latest release is 1.5.5 so I'm confused.

To investigate, you'll need to open up the underlying Redis dataset and inspect the structures. If you have a throttle for queue "foo", the relevant keys are:

List free:foo (available tokens)
ZSet taken:foo (tokens in use)
Hash throttle:foo (metrics)

Those counts tell me that free and taken are both empty, meaning the tokens within them are gone somehow. If you restart Faktory, it should reset the throttle and restore the missing tokens.

Can you show me your throttle configuration?

motymichaely commented 2 years ago

@mperham - Thanks.

About the release, you can see the /debug output is Faktory Enterprise 1.6.0: image

Not sure if I did it right, but I wasn't able to get any data for those keys for one of our throttled queues named mq_api:

redis /var/lib/faktory/db/redis.sock> LRANGE "free:mq_api" 0 -1
(empty array)
redis /var/lib/faktory/db/redis.sock> ZRANGE "taken:mq_api" 0 -1
(empty array)
redis /var/lib/faktory/db/redis.sock> HKEYS throttle:mq_api
(empty array)

Here's a snippet of our throttle configuration:

[throttles.mq_api]
concurrency = 5
timeout = 3600

I will now restart Faktory to see if that solves the issue.

motymichaely commented 2 years ago

Restarting Faktory server solved the issue. Now, worths asking what would cause Throttle tokens to be missing.

mperham commented 2 years ago

A bug of some sort. I will review the code tomorrow. Thanks for confirming.

On Sun, Oct 24, 2021 at 05:16 Moty Michaely @.***> wrote:

Restarting Faktory server solved the issue. Now, worths asking what would cause Throttle tokens to be missing.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/contribsys/faktory_worker_ruby/issues/65#issuecomment-950314319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWX7ZELPTP6PDFQRWX53UIP2IXANCNFSM5GLFM2MQ .

mperham commented 2 years ago

How busy is this queue and throttle? Are you hitting it all the time or will it sit idle for weeks? It's possible for the lock array to expire under you but that requires it to be idle for 30 days...

motymichaely commented 2 years ago

The queue doesn't seem to be busy at all. Yes, it's likely to be idle for 30 days. Any way to control it?

mperham commented 2 years ago

Ok, that's good news and an easy fix. I will extend the TTL to 365 days in the next version.

motymichaely commented 2 years ago

Great. Thank you for looking into it!