How to model crash transitions ?

damnMeddlingKid commented 3 weeks ago

I want to model a serial action which can crash. Heres an example

fair action DoWork:
    if current_state == State.Start:
        lock_service.acquire_lock()
        do_something
        do_another_thing
        lock_service.release_lock()

it is ok to yield anywhere in this action, but if the action crashes I would like to cleanup some state and release the lock.

right now im using a work around where i make the action atomic and use something like

if MaybeFail():
   return

to place failures in between lines.

jayaprabhakar commented 3 weeks ago

I was considering adding try-except-else-finally clause of python, or to support other types of flows where it can yield but never crash. but for most real examples, they were not required, so I didn't bother to add.

Can you let me know the exact case when you would need this? Based on the name lock_service, I assume, the lock is managed by a separate microservice/database.

In that case, no same-action cleanup solution would really work, when the model is translated into real code during development. The real solution would require a separate fair action to detect and cleanup.

damnMeddlingKid commented 3 weeks ago

Can you let me know the exact case when you would need this? Based on the name lock_service, I assume, the lock is managed by a separate microservice/database.

Yeah exactly. We have a process that acquires a lock from the lock service and then that process can die while holding the lock. In production we have a periodic health check that can detect that the process has died and then it performs some cleanup.

But i'm not sure how to model that detection.

jayaprabhakar commented 3 weeks ago

This will depend on how much you would want to model and the what happens when the node is working and thinks it is still holding the lock, but the lock service thinks the node failed and released the lock for some other process to acquire.

In the real world, the implementation most likely will have a timeout on the lock so it will get released after the timeout, and the nodes acquiring it will send periodic heartbeats, the heartbeats acts as lease renewal of the lock. The heartbeat interval will be significantly shorter than the lock release timeout. Before each critical operation, it must check if it still holds the lock. (That is, the lock was last renewed less than the heartbeat interval)

Since the system you are modeling, you are not interested in how the locks renewal and leases are implemented, you don't need to model the heartbeats, renewal etc, all we need to know is if the node still holds the lock.

You can simply maintain a holds_lock boolean variable. Before each critical step, atomically check holds_lock and execute the step. A separate fair action Timeout to release the lock.

fair action DoWork:
    if current_state == State.Start:
        lock_service.acquire_lock()
        holds_lock = True
        do_something()
        do_another_thing()
        # these two are optional for correctness, but as a spec good to mention this.
        holds_lock = False
        lock_service.release_lock()

func do_something():
    atomic:
       if not holds_lock:
         pass # Just to ensure it marks this block as enabled
         return
       ....

func do_another_thing():
    # some steps
    atomic:
       if not holds_lock:
         pass
         return
       critical_section

      # more steps

fair action LockTimeout:
    holds_lock = False
    # calling lock_service.release_lock() is optional, that is an 
    # implementation optimization than modeling for correctness.

jayaprabhakar commented 3 weeks ago

Btw, the spec might be slightly simpler if lock_service.acquire_lock() returned a lock that you store as a state variable. The holds_lock will be within the lock record. But for now, start with whatever is the simplest

damnMeddlingKid commented 3 weeks ago

Thanks!, it's similar to what i've been doing. I didn't realize that i need a pass to activate the block, but it makes sense it needs some statement execution to be enabled.

In my actual model there are several more lines and i'd like to model failing at any point, so it gets a bit tedious to check "holds_lock" everywhere. Implementing try/except makes sense to me, but maybe with different terminology. It's up to the author to think about if the model reflects reality ?.

jayaprabhakar commented 1 week ago

In a real implementation, isn't the lock always checked (either explicitly to check if the lock hasn't timeout or implicitly where the individual operations have a much shorter timeout compared to the lock timeout) before any critical operation?

In most cases require holds_lock should be sufficient. In that case, this should be short and makes the implicit assumption explicit.

fizzbee-io / fizzbee

How to model crash transitions ? #77