EspressoSystems / seahorse

Ledger-agnostic wallet toolkit for the CAP protocol
https://seahorse.docs.espressosys.com
GNU General Public License v3.0
6 stars 1 forks source link

Dropping `Keystore` in async context blocks executor thread #285

Closed jbearer closed 1 year ago

jbearer commented 2 years ago

The destructor of Keystore may block the thread from which it is called, due to the drop of the AsyncScope being blocking. Since Keystore::new is async, one would have to go to some trouble to drop the keystore in a non-async context (such as returning it from a top-level future, and dropping it in main without using async_std::main). This means that when the destructor blocks, it almost always blocks an executor thread. When running with ASYNC_STD_THREAD_COUNT=1, this hangs the executor. Currently, the test suite hangs with ASYNC_STD_THREAD_COUNT=1.

The details of why the destructor blocks are subtle. Briefly, AsyncScope is only safe because upon being dropped, it cancels and then joins all the tasks that have been spawned in that scope, thus ensuring that the tasks are not running after their lifetime has ended. The destructor of AsyncScope calls cancel() followed by collect().await. However, there is no way in current Rust to return control to an async executor from a drop handler; that is, drop handlers must be synchronous. To execute the async collect() future then, the destructor uses block_on. When block_on is called in an async context, it calls the future's poll function and, if that pends, blocks the calling thread. This will deadlock if there is only one executor thread, since there are no other executor threads to execute the future that the block_on thread is pended on.

Now, it is true that calling poll on an Abortable future after the future has been aborted (since we are dealing with a single thread, and aborts are orchestrated via a single, atomic memory location, "after" has a well-defined and intuitive meaning) will never return Pending. This appears to be the assumption that AsyncScope is making to avoid pending in the destructor, since indeed abort is called on all of the abort handles before any of the tasks are awaited. However, the trick is that the AsyncScope destructor does not await directly on the Abortable future. It awaits the task handle for the task on which the future was spawned. Polling a task handle does not poll the future running on that task. It simply loads the status of the task and yields if the future is already complete. Moreover, an Abortable future does not complete merely because abort is called. It will only complete when it is next polled and actually checks the abort flag. Since nothing in the destructor of AsyncScope causes the future itself to be polled, the destructor will block until another executor thread executes the aborted future one more time; if there are no other executor threads, we deadlock.

There is an easy way and a hard way to fix this. The easy way is to ensure all futures spawned in an AsyncScope are joined in an async function (using await rather than block_on) before the destructor is called. This could be accomplished by adding the following method to Keystore:

pub async fn close(mut self) {
    self.task_scope.cancel();
    self.task_scope.collect::<Vec<_>>().await;
}

Callers would have to remember to call this function instead of dropping the Keystore, so this is not an ideal API. We may be able to catch mistakes with Clippy by adding #[must_use] to Keystore. In any case, forgetting to call close would have no adverse effects as long as the executor was running with at least 2 threads. (Generally, we would require one more thread than is required to complete all of the tasks the AsyncScope is joining. However, completing an aborted task can always be done with just one executor thread.)

The better, harder way to fix this is by avoiding AsyncScope altogether. There are two ways we could do this. The first, and, in my opinion, best, is to remove the 'a lifetime from all of Seahorse, so that all of the data structures used by the keystore tasks (e.g. LedgerState) are 'static. 'a is the lifetime of the reference to the UniversalParam used by the keystore's proving keys. This lifetime is always 'static, and in fact reef already requires the lifetime to be 'static. Therefore, we could simply have ProverKeySet<'static> in LedgerState and remove the lifetime everywhere else, giving us 'static futures that can be spawned normally (using async_std::task::spawn without AsyncScope. As a bonus, this would simplify the code and interfaces, and would likely allow us to be done with the workaround for the async lifetime compiler bug.

This problem can also be fixed without forcing the UniversalParam reference to be 'static. The lifetime 'a is only used in the proving keys, which are only used for transaction building, not by any of the background tasks. We could split the proving keys from the data that is actually required by the background tasks, and pass only the required, 'static data to the futures, again giving us 'static futures that can be spawned normally. This would complicate the definition of the LedgerState model (probably requiring its own internal Arc<Mutex<_>> around the 'static data) for dubious gain. Non-static universal params are only beneficial in an application where we are dynamically creating different universal params, so we can free ones we have created but don't need anymore .The only application I can think of where dynamically created universal params would be remotely useful is something like MetaMask, where a single application can switch between multiple chains, perhaps each with their own universal setup. But even then, the number of different chains will be in the 10s at most, and the memory overhead of making all of them static would be relatively small. In the absolute worst case, which I view as very likely not necessary, we could even run each chain's wallet in its own process and reap the whole process when we were done with that chain. Therefore, I believe the tiny benefit of keeping the lifetime generic does not justify the extra complexity of this change, relative to the current state of things, but even moreso relative to the potential simplification we could get by removing the lifetime.