edgurgel / verk

A job processing system that just verks! 🧛‍
https://hex.pm/packages/verk
MIT License
723 stars 65 forks source link

Mention how to configure the node_id for Heroku #155

Closed myobie closed 6 years ago

coveralls commented 6 years ago

Coverage Status

Coverage remained the same at 87.892% when pulling e3b6aba2121a1b0a014e1dba73098fde888cf042 on myobie:patch-1 into 5d24f6673a184b1c4f04f5d828f8e3f3189b47df on edgurgel:master.

edgurgel commented 6 years ago

Thanks, @myobie !

sebastianseilund commented 6 years ago

_It is possible that two dynos with the same name could overlap for a short time during a dyno restart.

I'm curious about what happens in that case. As far as I can read the code, the second dyno will put all in-progress jobs back on the queue, which will cause the jobs to run at least twice (they could fail or this could happen multiple times). Is that just how it's supposed to be, or am I missing something?

Also, what happens if you scale down the number of dynos (say from 4 to 3). Any jobs in progress on dyno 4 (with DYNO env variable set to jobs.4) might be aborted before they finish, and they would be stuck in an inprogress list for a node_id that won't come back up (any time soon at least).

Similarly, I'm curious how other people use the node_id config when running in e.g. Kubernetes (or anywhere else where you have ephemeral Docker containers that come and go).

With node_id defaulting to "1" I imagine that every single new container that come online will re-enqueue all in-progress jobs.

Any guidance would be greatly appreciated. Thanks! :)

edgurgel commented 6 years ago

Ideally you want to use node_ids that are unique.

I'm curious about what happens in that case. As far as I can read the code, the second dyno will put all in-progress jobs back on the queue, which will cause the jobs to run at least twice (they could fail or this could happen multiple times). Is that just how it's supposed to be, or am I missing something?

Exactly.

Also, what happens if you scale down the number of dynos (say from 4 to 3). Any jobs in progress on dyno 4 (with DYNO env variable set to jobs.4) might be aborted before they finish, and they would be stuck in an inprogress list for a node_id that won't come back up (any time soon at least).

The jobs from 4 will be lost until dyno 4 comes back. There's nothing in place to guarantee that these jobs will be picked up by someone else (say node 3). We could mitigate this issue by moving jobs back to the queues before shutdown but a unclean shutdown won't protect Verk.

Because Verk instances have no coordination it's hard to decide and control who owns what :(

Please feel free to propose a solution to redesign this part! I went for the simplicity and kept the job of keeping things consistent to the user :P

sebastianseilund commented 6 years ago

Thank you for taking time to answer this :)

Before I read deeper about Verk, I assumed that tasks were RPOPLPUSHed (or similar) from a queue list to an inprogress list that are not node-specific, and that some mechanism would move tasks from inprogress back to the queue after a timeout (could be configured per queue). That would mean that a task has to finish within this timeout, or it will be executed twice (because we believe that the first run was aborted for some reason). This mechanism could run on all nodes, and I imagine it could periodically scan the inprogress list for tasks that have exceeded their timeout and push them back to the queue.

That's how products like SQS/RabbitMQ work, I believe (at least from the consumer side).

From what I can tell both Exq and Toniq work the same way as Verk, which is why I get a feeling that I'm misjudging this as a problem. But I really can't tell. The fact that jobs from dyno 4 in the example above would be lost sounds problematic to me - especially if you scale up and down a lot. I don't see a way to use verk/exq/toniq in e.g. Kubernetes with rolling upgrades (where unique and persistent node_ids sound impossible) without either using a random node_id (which would cause all killed jobs to be stuck) or a static node_id (which would cause all in-progress jobs to be re-queued and run multiple times on every container start).

What do you think? How do you use Verk yourself in your deployed applications?

edgurgel commented 6 years ago

This mechanism could run on all nodes, and I imagine it could periodically scan the inprogress list for tasks that have exceeded their timeout and push them back to the queue.

This is an option yes!

That's how products like SQS/RabbitMQ work, I believe (at least from the consumer side).

SQS has the feature but AFAIK RabbitMQ has no timeout to consume a message. It will just wait for that channel/connection to drop and it will enqueued the messages back to the queue.

What do you think? How do you use Verk yourself in your deployed applications?

We currently don't auto-scale Verk but it's something that we would love to have. Right now our load is very predictable so it never pushed us to solve this issue.

We could have some sort of "heartbeat" for each instance that is registered instance_id (a random node id + ip + host + pid?) and once this heartbeat is X seconds old (using maybe a TTL on Redis), then enqueued these jobs back? Keep a set of all instance_ids somewhere to know which ones need to be checked? Remove them once they are cleaned up? (I probably need to think more about this TBH)

If you have any suggestions please feel free to contribute :)

sebastianseilund commented 6 years ago

You're right about RabbitMQ - I forgot that clients keep a connection open over AMQP.

I'll open a new issue to see if we can find a good solution to support hosts that come and go + auto-scaling. If we can, I'd be happy to contribute.