Improved handling of redis-based leadership lock

cjus commented 5 days ago

The key updates are:

Update the atomic lock handling with regards to setting expiration
Detailed logging of lock management errors
Tracking both whether we have the lock (has_lock) and whether we just acquired it (is_new_acquisition)
Only execute the leadership tasks if we both have the lock AND just acquired it (is_new_acquisition)
In the renewal case, maintain the lock but won't trigger leadership tasks again

Tested with local multi-moose and lock handing between processes.

vercel[bot] commented 5 days ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
framework-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 24, 2024 3:17am
framework-landing	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 24, 2024 3:17am
moose-logs-frontend	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 24, 2024 3:17am
moose-product-analytics	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 24, 2024 3:17am

cjus commented 4 days ago

Shouldn't the leader do the leadership tasks no matter what?

Leadership tasks can only be performed when an instance has become the leader for the first time. This means a process must have successfully acquired the leadership lock before having it to begin with. That's the use of the is_new_acquisition flag. The key to this PR working is the presence of multiple process instances of moose running. Leadership is passed between instances as leaders come and go. In the context of a single instance, the lock will not be created for the first time if a single process already has the lock. If a single instance loses the lock, then it will expire and can then be created and newly acquired.

In long conversations with Github Co-pilot and the o1-preview LLM's it was clear that they did not fully understand the fact that multiple instances would be at play. When I introduced the concept o1-preview identified the possibility that because we're setting a unique ID based on the container ID then, it's possible that a restarted container might not realize it is the leader - due to how the id ties to the key used to create the redis lock. That said. it didn't realize that multiple moose instances would not run in the same container.

It seems to me that the earlier goal of using the container ID to identify the process instance ID isn't ideal. Do you think we should handle that edge-case in this or another PR?

This is a known difficult-problem in that with distributed systems, there will always be some window of uncertainty during leadership transitions. After deeper digging, it's possible to improve leadership detection using a Lua script for the atomic operation. That's an improvement because Redis guarantees that Lua scripts are executed atomically so that no other Redis operation can interleave with the script's execution.

Proposed actions on this PR or improvements for a fast follow.

Revert to using unique UUIDs rather than hostnames so that container restarts don't introduce a race condition.
Apply the Lua script approach to acquire and set the Redis lock.

callicles commented 4 days ago

I don't think I understand why container ids for identifying the process doesn't work. Unless I am mistaken, you have not explained why that would not make sense. Is the issue that through pod restart the pod name changes ?

I fine letting this through but I am not sure about the follow ups

cjus commented 4 days ago

I don't think I understand why container ids for identifying the process doesn't work. Unless I am mistaken, you have not explained why that would not make sense. Is the issue that through pod restart the pod name changes ?

I fine letting this through but I am not sure about the follow ups

The reason container hostnames as unique IDs might not be a good idea is that restarting a process within the same container will result in the same ID. Naturally not the case if a new container is instead created.

514-labs / moose

Improved handling of redis-based leadership lock #1898