Restarting Faktory removes non-empty queues from UI and statsd metrics

contribsys / faktory

Language-agnostic persistent background job server

https://contribsys.com/faktory/

Other

5.72k stars 227 forks source link

Restarting Faktory removes non-empty queues from UI and statsd metrics #298

Closed jdreaver closed 4 years ago

jdreaver commented 4 years ago

Hello!

We run Faktory Pro on ECS via AWS EC2. We have an EFS file system that holds all of the Faktory file system state. We periodically upgrade the underlying EC2 instance, which of course requires us restarting Faktory.

We noticed that when Faktory restarts, the queue list UI at https://faktory.freckle.com/queues gets cleared. It shows no queues, and we only see the queues after they get more jobs enqueued to them. It appears that the queues are not actually cleared under the hood, because our job consumers are still running and accepting jobs (we see them running DB queries, doing work, and reporting progress). However, we can't see the queues in the UI. Also, the statsd metrics for the missing queues are no longer emitted.

Is this known behavior? Should the UI always reflect the latest state of all queues, even after a restart?

We are on the latest Faktory Pro version 1.4.0.

mperham commented 4 years ago

Yep, this is a known issue. Since queue names don't have a well-known prefix in Redis, I can't scan for them on startup. As jobs push, schedule or retry, Faktory will learn the current queue set and since Faktory is designed to run 24/7, the thought was that this period of ignorance should be infrequent and short.

How often are you restarting Faktory?

jdreaver commented 4 years ago

We restart once per week to ensure we are on the latest Amazon Linux version. We could definitely update Amazon Linux less often, but we err on the side of more frequent updates so we can diagnose any potential issues quicker.

Some of the jobs queues in questions are populated once per night or once per week. We noticed this issue because our weekly job is running a bit slow, and the restart at the beginning of the week caused the queue to disappear from the UI.

mperham commented 4 years ago

The quick and easy hack workaround is to add your queues to the Statsd latency list:

https://github.com/contribsys/faktory/wiki/Pro-Metrics#latency

Faktory will know about it so it should always appear in the Web UI. If the queues are mostly empty, you won't see much overhead in checking latency.

jdreaver commented 4 years ago

Sounds good!

Can I make a feature request here then? :smile: Would it be possible to add a way to automatically track latency for all jobs? All of our queues are empty or close to empty most of the time, so I don't think it will be a big performance hit for us.

mperham commented 4 years ago

That's possible although it treads the line of "not a best practice" because latency checks are relatively expensive and if you have 100s of queues, the overhead can become significant. In other words, it starts to look like a footgun at scale. That's what I'm paid to worry about...

jdreaver commented 4 years ago

That makes sense. We can try to DRY our list of queue names in our automation for the time being, or just whitelist some queues we really care about.

jdreaver commented 4 years ago

I found another workaround: I went to the Busy tab, found one instance of our job executing, and clicked on the job name. This took me to https://faktory.freckle.com/queues/<job-name>. Once I went back to https://faktory.freckle.com/queues, the queue was backed and showed all the enqueued jobs.

mperham commented 4 years ago

Ah yes, constructing the queue URL is perfect. No overhead at all, you can script it and use curl right after restart.

On Wed, Apr 22, 2020 at 9:26 AM David Reaver notifications@github.com wrote:

I found another workaround: I went to the Busy tab, found one instance of our job executing, and clicked on the job name. This took me to https://faktory.freckle.com/queues/. Once I went back to https://faktory.freckle.com/queues, the queue was backed and showed all the enqueued jobs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/contribsys/faktory/issues/298#issuecomment-617885441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWX7SBNKVWPFXQWGW4BTRN4LEFANCNFSM4MNM35CA .