Open MOZGIII opened 1 month ago
The root cause has been boiled down to how locking inside of the invoke
interacts with the locking at next_raw_update
in a situation where, like we have in the example, we select the next_raw_update
together with something else (in our case a timer) and inside that other select branch we do the invoke
.
That leads to a deadlock, where invoke
tries to lock on the sender
mutex, and waits infinitely; normally this wouldn't be a problem as the current holder of the lock would make progress and relieve the invoke
from needing to acquire the lock on sender by just returning the result - however our this case, the sender
is locked by next_raw_update
, and we are not in a select
statement (macro) itself, but in a select
branch - meaning that next_raw_update
future is not polled anymore, despite actually being ready to progress. This is the ultimate reason why the sender.step
isn't being polled, and why the sender
can't make progress; couple that together with a lock on sender
at invoke
- and we get a deadlock.
Note that if invoke
wouldn't try to lock on sender
and would just wait for rx
to get ready we'd still get the same deadlock, since the current sender
driving future can't progress.
There are no issues with tokio
- well, really, as expected - in fact, it can't do anything at that point because the composed future state machine simply does not allow to poll on that sender.send
code branch.
I'd argue that grammers-client
implementation should be altered to make it impossible to get in a situation where the client can't progress - especially since it was clearly designed to allow the progress to be made in this kind of a situation - meaning I'd consider this a bug.
@Lonami how are you willing to proceed? I can work on a fix for this when I get back to it - but at the moment I'm busy elsewhere.
I still need enough brain power to read through and truly understand the problem. I don't know when that'll be.
I'm not under any time pressure to see this fixed. So if you have a clear path forward on your mind, that'd be best.
This is a very peculiar issue, but fortunately I have managed to come up with an relatively small example that reproduces this bug: https://github.com/MOZGIII/grammers-halt
The issue manifests with this:
If the code was working as intended, the log line
main::reconciler: before iter next
should've been followed by something else - either an error, ormain::reconciler: after iter next
; however the code stalls and does not progress.My investigation showed:
grammers_mtsender::Sender
by thegrammers_client::Client
call ofnext_raw_update
call under the hood - which is intended; in other words this is not deadlock in a general sense. Tested this via tokio console from async and lldb for sync mutexes.tokio::sync::Mutex
specifically, since I have tried swapping the mutex implementation in question withasync_mutex::Mutex
and got the same outcome - the app stalled.The
select
chain at https://github.com/Lonami/grammers/blob/724188fd4c95a835308cdca051cee4e96e012f21/lib/grammers-mtsender/src/lib.rs#L333-L338 does not progress, despite the conditions for the progression of the individual futures being fulfilled; I do not mean theselect
is the issue here, because it seems that the reason they do not progress is because the whole task doesn't get scheduled - i.e. the futures (or, rather, the combinedselect
future) do not progress because are not polled, meaning the task is simply not schedule - not because the futures are polled but are stuck. I have managed to test this by trying this code here:Screenshot of the code here.
![telegram-cloud-photo-size-1-4961085924855361127-y](https://github.com/user-attachments/assets/6a993ce5-dad3-4e76-ad66-233041017da8) Sorry for the screenshot, didn't commit it, and this is all I got now.I wrapped the https://github.com/Lonami/grammers/blob/724188fd4c95a835308cdca051cee4e96e012f21/lib/grammers-mtsender/src/lib.rs#L321 with
tokio::spawn
and added logging inside thatspawn
ed future after thesleep
returns; then made theselect
wait onJoinHandle
of that spawned task. After a minute, I saw the log message - which means that timer worked - butselect
that was polling on theJoinHandle
didn't return - which means that the task that executes select was not scheduled uponJoinHandle
returning. Rather odd, yet maybe I'm missing something.I don't know what's going on there, but I suspect a
tokio
bug.I have managed to create a workaround - essentially the same logic but executed with two independent spawned tasks instead of structured concurrency, it can be found at https://github.com/MOZGIII/grammers-halt/tree/working1.
See https://github.com/MOZGIII/grammers-halt/compare/master...working1 for the changes that make the code work - does not solve the issue, but at least might be useful for someone who needs to work around this bug in the meantime.