OpenCyphal-Garage / cyraft

Raft algorithm (for pycyphal)
2 stars 2 forks source link

TimerHandle.cancel() not working as expected? #8

Closed maksimdrachov closed 1 year ago

maksimdrachov commented 1 year ago

The leader election "works", however there's a couple of small issues I'm still struggling with. Fundamentally they are all related to scheduling/cancelling the delayed callbacks (for term/election timeouts).

Running tests/raft_leader_election.py:

1) pytest -k _unittest_raft_fsm_1 --pdb

image

The above warning-message appears. I don't see why though. At the end I'm using cancel(); which should take care care of cancelling both callbacks:

https://github.com/OpenCyphal-Garage/cyraft/blob/a8e2121532546285fbd43818ed6bfa7a46ffd62b/cyraft/node.py#L717-L727

The thing is, it does take care of cancelling those callbacks, for example if add an additional 10 seconds of sleep (expecting that if the callback was not cancelled properly it would have enough time to execute), it doesn't execute anything:

image

image

(Nothing happens.)

2) Now comes the confusing part, the previous the unit tests should indicate that the heartbeat mechanism works correctly, however it starts to mess up with the last 2 unit tests? For example running:

pytest -k _unittest_raft_fsm_3 --pdb

image

Even though node 43 gets the heartbeat message as expected, it resets the election timeout timer, it still somehow ends up timing out on the election timeout set at the start of the unit test?

Am I missing something?

pavel-kirienko commented 1 year ago

The above warning-message appears. I don't see why though

IIRC when you cancel a coroutine, it throws an asyncio.CancelledError which has to be consumed somehow; otherwise, you get the warning.

Now comes the confusing part, the previous the unit tests should indicate that the heartbeat mechanism works correctly, however it starts to mess up with the last 2 unit tests?

This means that some resources have not been finalized in the previous tests.

maksimdrachov commented 1 year ago

Ok, class pay attention!

If you need to schedule a delayed callback with a regular function, you can do the following:

self._timer = loop.call_at(self._next_timeout, self._on_timeout)

However, if you'd like to schedule a delayed callback with a coroutine, you'll have to do the following:

self._timer = loop.call_at(self._next_timeout, lambda: self._on_timeout())

If you do the following instead:

self._timer = loop.call_at(self._next_timeout, asyncio.create_task, self._on_timeout())

It will work, however if you try to cancel this task it will issue a warning:

RuntimeWarning: coroutine 'RaftNode._on_timeout' was never awaited

(Due to the task already being scheduled, and calling cancel on the TimerHandle won't change this fact)

pavel-kirienko commented 1 year ago

If you need to schedule a delayed callback with a regular function, you can do the following:

Makes sense.

However, if you'd like to schedule a delayed callback with a coroutine, you'll have to do the following:

Here, self._on_timeout is equivalent to lambda: self._on_timeout() aside from the extra level of indirection in the latter. The lambda will return an awaitable item that will have to be awaited by the caller, but the timer implementation will not do it. Why does this work?

maksimdrachov commented 1 year ago

Sorry, correction:

self._timer = loop.call_at(self._next_timeout, lambda: self._on_timeout())

should be:

self._timer = loop.call_at(
  self._next_timeout, lambda: asyncio.create_task(self._on_timeout())
)
pavel-kirienko commented 1 year ago

Is this not equivalent to:

self._timer = loop.call_at(
  self._next_timeout, asyncio.create_task, self._on_timeout()
)

EDIT: disregard that, it is not. The time of invocation of on_timeout is different.