Closed dpc closed 3 months ago
maybe calling tokio block_on from inside tokio spawn_blocking
ref https://docs.rs/tokio/latest/tokio/task/fn.block_in_place.html
oh this explains why #3989 is failing, because it replaces sync mutex with async mutex, so an .await
point
yep 28b036126d361325bd9c167b96b3ad9e7cae2a1c fixes CI for #3989
yep 28b0361 fixes CI for #3989
have you tried that on https://github.com/fedimint/fedimint/pull/3995? it seems it doesn't fix the problem there
yep https://github.com/fedimint/fedimint/commit/28b036126d361325bd9c167b96b3ad9e7cae2a1c fixes CI for https://github.com/fedimint/fedimint/pull/3989
This might be random, and generally the whole problem is only exposed by #3995, so I would hope/expect the proper fix to #3995 (at least stop it from hanging alephbft tasks as I described). If it isn't ... something might still be wrong.
I need to find the caller alephbft code and check how exactly is it calling blocking Write::flush
operations. I suspect if offloaded these to blocking thread, possibly via own block_in_place
call. So we're now in a situation where alepbft offloaded already ,we are running on blocking threads and trying to async again ... which might required something special to work. Or just causes tokio bug.
I assume tokio::runtime::Handle::block_on also runs other tokio tasks while blocked
and futures::executor::block_on is a simple loop https://docs.rs/futures-executor/0.3.30/src/futures_executor/local_pool.rs.html#89-101
@joschisan there are two issues here:
aleph-bft
uses the std::io::Write
trait in a very weird way. The implemented semantics are out of spec imo (switching where you write to based on flush being called …).I think we should upstream fixes for this, which would likely take the form of a new async trait for their writer object (same for reader too I guess).
i will try changing upstream UnitSaver to async, it looks simple.
using tokio::task::block_in_place
with tokio::runtime::Handle::current().block_on(..)
still deadlocks
async unit writer and unit reader for alephbft - https://github.com/Maan2003/AlephBFT/commit/57e0cfc404454866cc4f658369810a99cbb923b6
nix build env - https://github.com/Maan2003/AlephBFT/commit/55af52d8b880467497dea94e7c76c5869c379227
async unit writer and unit reader for alephbft - Maan2003/AlephBFT@57e0cfc
nix build env - Maan2003/AlephBFT@55af52d
nice! left a small comment there
9338d99cba (https://github.com/fedimint/fedimint/pull/4073) fixes it on top of #3995
EDIT: looks like i broke reconnect test (federation restarts) :fearful:
Oh, so did I read it right that the old alephbft implementation was calling a blocking io trait without offloading it from the executor? That would explain why things were weird when attempting to fix it on our end. :+1:
Nice job squashing this async lasagna a bit!
Oh, so did I read it right that the old alephbft implementation was calling a blocking io trait without offloading it from the executor?
yes, too many function colors. makes my head spin async(aleph-bft) => sync(UnitSaver::flush) => async(db transaction) => sync (rocksdb)
EDIT: looks like i broke reconnect test (federation restarts) 😨
Probably a flake. Triggered the job again
I already restarted job once, it might be a real bug. UnitReader is used during startup
yes, I didn't copy correctly https://github.com/fedimint/fedimint/pull/4073/commits/32f9e28d5e637bc75ff31ba5dbf231d800b0a42c
@joschisan there are two issues here:
aleph-bft
uses thestd::io::Write
trait in a very weird way. The implemented semantics are out of spec imo (switching where you write to based on flush being called …).- Only having a sync function to do IO isn't ideal since a lot of io is implemented as async and mixing sync+async is generally a bad idea.
I think we should upstream fixes for this, which would likely take the form of a new async trait for their writer object (same for reader too I guess).
That's what I thought as well. Did just not get around to do it yet.
dev call: fork for now and use upstream if / when change is upstreamed
I forked AlephBFT into fedimint, let's create a patched version of the version we are currently using and then publish it to crates.io.
Just published fedimint-aleph-bft, last step is to switch fedimint over to it.
Just published fedimint-aleph-bft, last step is to switch fedimint over to it.
I can take care of that.
most changes are already done in https://github.com/fedimint/fedimint/pull/4073
While investigating something entirely different I've noticed that
UnitSaver::flush
is usingfutures::executor::block_on
which I don't think is entirely right thing to do when running in tokio Runtime (and thus everything hanged whentokio::task::yield_now
was added to the mix). Unfortunately these kinds of issues require at least 200 IQ points to solve so I could use some help. :D