edgurgel / verk

A job processing system that just verks! 🧛‍
https://hex.pm/packages/verk
MIT License
723 stars 65 forks source link

Resilient Verk - Fix scaling up and down #159

Closed edgurgel closed 5 years ago

edgurgel commented 6 years ago

Hey team here is my first stab at solving this issue: https://github.com/edgurgel/verk/pull/159/files

The idea is:

(frequency) = 60 seconds ?

On starts Each node generates a new id (We can check if it's actually new by the result of SADD)
    SADD nodes node_id
    PSETEX verk:node:#{node_id} 2 * frequency alive]
Each time a node starts working on a queue the queue name is added to "node:queues" set;
Each time a node stops working on a queue the queue name is removed from "node:queues" set;
Each frequency seconds we set the node key to expire in 2 * frequency
    PSETEX verk:node:#{node_id} 2 * frequency alive]
    Also check for all the keys of all nodes. If the key expired it means that this node is dead.
    To restore we go through all the running queues of that node and enqueue them from progress back to the queue. Each "enqueue back from in progress" is atomic (<3 lua) so we won't have duplicates.

We may need to review some edge cases like what if we still have unfinished jobs while removing a queue from the list of running queues etc but I will work on them case by case

I need to review this as clearly it's just a stab at the final solution. I've played with some instances running locally and so far so good.

Related to https://github.com/edgurgel/verk/issues/157

edgurgel commented 6 years ago

Hey After releasing 1.4 my plan is to somehow introduce this as "experimental" so it won't affect current users and they can try it out. I should have something "ready" in 2 weeks? I need to figure it out how to run integration tests

tlvenn commented 6 years ago

Hi @edgurgel, any update ?

edgurgel commented 6 years ago

@tlvenn , not really but I intend to get back to this probably next week. I need to find a simple way of making this optional for now so that we can release a non-major version 🤔

Maybe if no node_id was defined Verk could generate one and keep track of these automatically generated etc.

I also don't know which kind of configuration should we expose for example:

And I'm not 100% sure how to run nice integration tests running 2 Verk instances etc...

Coordination is hard 😢

mikeastock commented 6 years ago

@edgurgel is this PR ready to be tested as is?

edgurgel commented 6 years ago

@mikeastock yes it works as expected! I need to add more tests and decide some other considerations. My goal is to have the next minor version with an option to use this to control your node ids. And my "release date" is end of October maybe before that

edgurgel commented 5 years ago

I said end of October but it will probably be mid November 😶

I'm adding some tests and ensuring this can be used as optional until it's robust enough to be used by all users.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-4.6%) to 83.333% when pulling 1949baec0bbe1e92496b07ed5fd110e3f9e02e8f on resilient-verk into 9365c712bc4062d09cf78a397bcc09aaf1b7494c on master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-4.6%) to 83.333% when pulling 1949baec0bbe1e92496b07ed5fd110e3f9e02e8f on resilient-verk into 9365c712bc4062d09cf78a397bcc09aaf1b7494c on master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-4.6%) to 83.333% when pulling 1949baec0bbe1e92496b07ed5fd110e3f9e02e8f on resilient-verk into 9365c712bc4062d09cf78a397bcc09aaf1b7494c on master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-4.6%) to 83.333% when pulling 1949baec0bbe1e92496b07ed5fd110e3f9e02e8f on resilient-verk into 9365c712bc4062d09cf78a397bcc09aaf1b7494c on master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-4.6%) to 83.333% when pulling 1949baec0bbe1e92496b07ed5fd110e3f9e02e8f on resilient-verk into 9365c712bc4062d09cf78a397bcc09aaf1b7494c on master.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.4%) to 89.194% when pulling d7e674a821c97e8b2f6dc84b5d3698eb3ba9b324 on resilient-verk into 741ee4878cf653286ab2eff57f3e7885bd4b428e on master.

tlvenn commented 5 years ago

@edgurgel maybe a xmas gift in the end ? ;)

edgurgel commented 5 years ago

@tlvenn, you joke but that's the plan :D! I will have some free time before christmas haha :)

edgurgel commented 5 years ago

I'm very close btw! Happy New Year! 🎉

tlvenn commented 5 years ago

Happy new year to you too @edgurgel !

edgurgel commented 5 years ago
SADD nodes node_id
PSETEX verk:node:#{node_id} 2 * frequency]

How to use:

If it's not true it won't use this new code. It will basically work as before.

edgurgel commented 5 years ago

I still have some minor things to change but the bulk of the work is done 👍