filecoin-project / rust-fil-proofs

Proofs for Filecoin in Rust
Other
489 stars 314 forks source link

fix: better error handling #1751

Open vmx opened 5 months ago

vmx commented 5 months ago

https://github.com/filecoin-project/rust-fil-proofs/pull/1748 fixed a problem where errors (asserts) where not properly propagated upwards. There were threads just hanging without making any progress.

I usually follow the development paradigm of "Make it work, make it right, make it fast" 1, where I interpret the "make it right" as "make it simple". With "simple" I mean things like clear code, easy to follow, no new paradigms, following the style of the existing code base.

I'd like to be more concrete about that for this change, on why I find this version "simpler".

The core issue was an assert within a spawned thread in a yastl pool. Someone new to the code base coould re-introduce such a problem easily, hence it would be best if we can actually prevent that. With removing the yastl thread pool, where it's not really needed for better performance, we can easily prevent that.

We see three difference uses of the yastl thread pool within the proof.rs. One is using it to pipeline operations, another one is for a highly parallizing an operation. Usually for data parallism we use Rayon in this code base. In the pipelining use case, there is Rayon used within pipeline. The yastl thread pool will spawn only a limited set of threads, so this looks alright. For the highly parallelised case, we only use yastl and not Rayon (although we should look into just Rayon instead).

The third use, where we mix yastl and Rayon for highly parallel operations is removed with this PR. This is intentional. Having two thread pools, which by default use as many threads as cores could easily overprovision a system. This could potentially lead to unintended slow downs. Another benefit of just using Rayon is, that the number of threads can be controlled with an environment variable. This gives more control when several instances of this code is run in parallel, which is the case for some storage providers.

Switching back from errors to asserts. I don't know the reason, why the asserts were changed to errors. Hence I'm switching it back to asserts, as now it's easily possible. If one looks at the diff between the version prior to https://github.com/filecoin-project/rust-fil-proofs/pull/1728 and this, then the diff is pretty minimal an straight forward. Also one verification again runs only in debug mode and not also in release mode. Though if errors are desired, they can be easily be introduce by switching the asserts to anyhow's ensure!() macro.

Following the error handling needs less context. With this change, the asserts are happening in the code right away, it works the way people coding Rust are used to. Prior to this change, more context is needed. Taking the "invalid comm_d"-error as an example. It happens on line 456 2. Now reading through the code what it means: first look for the invalid_comm_d variable. It defines an instance of the InvalidChallengeCoordinate struct, which we take a quick look at and see that's a local one, for specifcally error handling. That instance is wrapped in an Arc which is interesting at a first glance, as in Rust shared state is usually tried to be avoided where possible. When looking into the usage of invalid_comm_d it becomes clear that we need the Arc, as the might assign a different value to it in the error case. We need a mutable reference here, but again usually in Rust immutable types are preferred. So if we can avoid the Arc as well some mutability, it's a win for being more Rust idiomatic, hence easier for people familiar with Rust.

For me that falls under the "code is written once, but read many times" category. So making it easy to read is a win.

In the lower part of the change, the yastl usage for high data parallelism is removed in favour of Rayon, see above for some of the reasons. Also using Rayon here seems to use the thread pool more efficiently, at least on the machine I've tested it on (with 64 threads).

When looking at the diff between this change and the commit prior to https://github.com/filecoin-project/rust-fil-proofs/pull/1748

git diff https://github.com/filecoin-project/rust-fil-proofs/commit/8f5bd86be36a55e33b9b293ba22ea13ca1f28163.. -w -- storage-proofs-porep/src/stacked/vanilla/proof.rs

or via GitHub.

Then the changes are very minimal, which I also count as a sign for being "simpler".

vmx commented 5 months ago

This PR is easiest review without whitespace: https://github.com/filecoin-project/rust-fil-proofs/pull/1751/files?diff=split&w=1