Lonami / grammers

(tele)gramme.rs - use Telegram's API from Rust
https://t.me/gramme_rs
Apache License 2.0
562 stars 116 forks source link

Progress halt with a certain future composition on the grammers-client API user side #282

Open MOZGIII opened 1 month ago

MOZGIII commented 1 month ago

This is a very peculiar issue, but fortunately I have managed to come up with an relatively small example that reproduces this bug: https://github.com/MOZGIII/grammers-halt

The issue manifests with this:

telegram-cloud-photo-size-1-4961085924855361126-y

If the code was working as intended, the log line main::reconciler: before iter next should've been followed by something else - either an error, or main::reconciler: after iter next; however the code stalls and does not progress.

My investigation showed:

I don't know what's going on there, but I suspect a tokio bug.


I have managed to create a workaround - essentially the same logic but executed with two independent spawned tasks instead of structured concurrency, it can be found at https://github.com/MOZGIII/grammers-halt/tree/working1.

See https://github.com/MOZGIII/grammers-halt/compare/master...working1 for the changes that make the code work - does not solve the issue, but at least might be useful for someone who needs to work around this bug in the meantime.

MOZGIII commented 1 month ago

The root cause has been boiled down to how locking inside of the invoke interacts with the locking at next_raw_update in a situation where, like we have in the example, we select the next_raw_update together with something else (in our case a timer) and inside that other select branch we do the invoke. That leads to a deadlock, where invoke tries to lock on the sender mutex, and waits infinitely; normally this wouldn't be a problem as the current holder of the lock would make progress and relieve the invoke from needing to acquire the lock on sender by just returning the result - however our this case, the sender is locked by next_raw_update, and we are not in a select statement (macro) itself, but in a select branch - meaning that next_raw_update future is not polled anymore, despite actually being ready to progress. This is the ultimate reason why the sender.step isn't being polled, and why the sender can't make progress; couple that together with a lock on sender at invoke - and we get a deadlock.

Note that if invoke wouldn't try to lock on sender and would just wait for rx to get ready we'd still get the same deadlock, since the current sender driving future can't progress.

There are no issues with tokio - well, really, as expected - in fact, it can't do anything at that point because the composed future state machine simply does not allow to poll on that sender.send code branch.

I'd argue that grammers-client implementation should be altered to make it impossible to get in a situation where the client can't progress - especially since it was clearly designed to allow the progress to be made in this kind of a situation - meaning I'd consider this a bug.

MOZGIII commented 1 month ago

@Lonami how are you willing to proceed? I can work on a fix for this when I get back to it - but at the moment I'm busy elsewhere.

Lonami commented 1 month ago

I still need enough brain power to read through and truly understand the problem. I don't know when that'll be.

I'm not under any time pressure to see this fixed. So if you have a clear path forward on your mind, that'd be best.