fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

`UnitSaver::flush` is using wrong `block_on`? #4054

Closed dpc closed 3 months ago

dpc commented 4 months ago

While investigating something entirely different I've noticed that UnitSaver::flush is using futures::executor::block_on which I don't think is entirely right thing to do when running in tokio Runtime (and thus everything hanged when tokio::task::yield_now was added to the mix). Unfortunately these kinds of issues require at least 200 IQ points to solve so I could use some help. :D

maan2003 commented 4 months ago

maybe calling tokio block_on from inside tokio spawn_blocking

ref https://docs.rs/tokio/latest/tokio/task/fn.block_in_place.html

maan2003 commented 4 months ago

oh this explains why #3989 is failing, because it replaces sync mutex with async mutex, so an .await point

maan2003 commented 4 months ago

yep 28b036126d361325bd9c167b96b3ad9e7cae2a1c fixes CI for #3989

douglaz commented 4 months ago

yep 28b0361 fixes CI for #3989

have you tried that on https://github.com/fedimint/fedimint/pull/3995? it seems it doesn't fix the problem there

dpc commented 4 months ago

yep https://github.com/fedimint/fedimint/commit/28b036126d361325bd9c167b96b3ad9e7cae2a1c fixes CI for https://github.com/fedimint/fedimint/pull/3989

This might be random, and generally the whole problem is only exposed by #3995, so I would hope/expect the proper fix to #3995 (at least stop it from hanging alephbft tasks as I described). If it isn't ... something might still be wrong.

I need to find the caller alephbft code and check how exactly is it calling blocking Write::flush operations. I suspect if offloaded these to blocking thread, possibly via own block_in_place call. So we're now in a situation where alepbft offloaded already ,we are running on blocking threads and trying to async again ... which might required something special to work. Or just causes tokio bug.

maan2003 commented 3 months ago

I assume tokio::runtime::Handle::block_on also runs other tokio tasks while blocked

maan2003 commented 3 months ago

and futures::executor::block_on is a simple loop https://docs.rs/futures-executor/0.3.30/src/futures_executor/local_pool.rs.html#89-101

elsirion commented 3 months ago

@joschisan there are two issues here:

I think we should upstream fixes for this, which would likely take the form of a new async trait for their writer object (same for reader too I guess).

maan2003 commented 3 months ago

i will try changing upstream UnitSaver to async, it looks simple.

maan2003 commented 3 months ago

using tokio::task::block_in_place with tokio::runtime::Handle::current().block_on(..) still deadlocks

maan2003 commented 3 months ago

async unit writer and unit reader for alephbft - https://github.com/Maan2003/AlephBFT/commit/57e0cfc404454866cc4f658369810a99cbb923b6

nix build env - https://github.com/Maan2003/AlephBFT/commit/55af52d8b880467497dea94e7c76c5869c379227

douglaz commented 3 months ago

async unit writer and unit reader for alephbft - Maan2003/AlephBFT@57e0cfc

nix build env - Maan2003/AlephBFT@55af52d

nice! left a small comment there

maan2003 commented 3 months ago

9338d99cba (https://github.com/fedimint/fedimint/pull/4073) fixes it on top of #3995

EDIT: looks like i broke reconnect test (federation restarts) :fearful:

dpc commented 3 months ago

Oh, so did I read it right that the old alephbft implementation was calling a blocking io trait without offloading it from the executor? That would explain why things were weird when attempting to fix it on our end. :+1:

Nice job squashing this async lasagna a bit!

maan2003 commented 3 months ago

Oh, so did I read it right that the old alephbft implementation was calling a blocking io trait without offloading it from the executor?

yes, too many function colors. makes my head spin async(aleph-bft) => sync(UnitSaver::flush) => async(db transaction) => sync (rocksdb)

douglaz commented 3 months ago

EDIT: looks like i broke reconnect test (federation restarts) 😨

Probably a flake. Triggered the job again

maan2003 commented 3 months ago

I already restarted job once, it might be a real bug. UnitReader is used during startup

maan2003 commented 3 months ago

yes, I didn't copy correctly https://github.com/fedimint/fedimint/pull/4073/commits/32f9e28d5e637bc75ff31ba5dbf231d800b0a42c

joschisan commented 3 months ago

@joschisan there are two issues here:

  • aleph-bft uses the std::io::Write trait in a very weird way. The implemented semantics are out of spec imo (switching where you write to based on flush being called …).
  • Only having a sync function to do IO isn't ideal since a lot of io is implemented as async and mixing sync+async is generally a bad idea.

I think we should upstream fixes for this, which would likely take the form of a new async trait for their writer object (same for reader too I guess).

That's what I thought as well. Did just not get around to do it yet.

maan2003 commented 3 months ago

https://github.com/Cardinal-Cryptography/AlephBFT/pull/399

justinmoon commented 3 months ago

dev call: fork for now and use upstream if / when change is upstreamed

elsirion commented 3 months ago

I forked AlephBFT into fedimint, let's create a patched version of the version we are currently using and then publish it to crates.io.

maan2003 commented 3 months ago

https://github.com/fedimint/fedimint-alephbft/pull/1

elsirion commented 3 months ago

Just published fedimint-aleph-bft, last step is to switch fedimint over to it.

joschisan commented 3 months ago

Just published fedimint-aleph-bft, last step is to switch fedimint over to it.

I can take care of that.

maan2003 commented 3 months ago

most changes are already done in https://github.com/fedimint/fedimint/pull/4073