Create binaries to run phases independently

vmx commented 1 year ago

Description

Currently the Filecoin proofs are consumed by Lotus via the FFI as single library. In addition to the library use case, the idea is to provide separate binaries for each phases (or perhaps even more fine-grained, but as a start the phases should be fine). This would serve several needs that occurred in the past:

Easier testing: Let's say you have a bug that only shows when unsealing previously sealed data (like https://github.com/filecoin-project/rust-fil-proofs/issues/1647). The whole process takes many hours. If you could run many of the phases only once, and the re-run parts of the pipeline to narrow down the issue, it could safe a lot of time
Bechmarking:
- Improving current setups: you either want to benchmark a code change or different hardware setups. Currently you'd run the whole process and then probably look at the logs how long things took. With separate binaries, you could prepare things up to a certain step you specifically want to benchmark and only iterate on that (also brought up at https://github.com/filecoin-project/rust-fil-proofs/issues/1676#issuecomment-1462085894).
- Comparing to other implementations: Recently Supranational publish a PC2 implementation as a standalone binary. For doing a comparison with the current implementation it would need to be integrated into the current code base. If there were binaries already, you could run the pc2 binary on the same input data and compare the results.
Flexible deployments: You might want to orchestrate the sealing process with your own tools. Currently it's an engineering task that requires Rust knowledge as you'd directly need to call into the Rust code, if you want to run specific pieces or want additional monitoring. With having separate binaries it becomes more of a dev-ops kind problem, where you can build tooling around

Acceptance criteria

There are binaries that could be run in sequence the do the full lifecycle of sealing and unsealing a sector.

Risks + pitfalls

It may lead to refactorings in case the current internal APIs do not fit. Though I see it as a good thing as the APIs should already be flexible enough to make this working.

Where to begin

benchy does already partly support running certain phases only. But it's not that flexible and has known issues.

cryptonemo commented 1 year ago

To be clear, Proofs is a library and will remain that way. Binaries would be an enhancement, using the library.

vmx commented 1 year ago

To be clear, Proofs is a library and will remain that way. Binaries would be an enhancement, using the library.

Thanks for calling this out. I've changed the first paragraph to make this clearer.

RobQuistNL commented 1 year ago

This would be an awesome feature to have - it would greatly help with benchmarking seperate stages and working on improvements.

It would be very nice to have a way to validate that the result of the benchmark is correct, too. Not sure if that's inherently possible as we're skipping some steps though.

Example would be;

cargo run --bin benchy -- single-step -- ap --sectornumber 123 --size 512MiB --result /mnt/benchfiles # Generates "unsealed" sector file (/mnt/benchfiles/unsealed/123/)
cargo run --bin benchy -- single-step -- pc1 --sectornumber 123 --result /mnt/benchfiles # Uses the "unsealed" sector file from the AP step, generates the layer files in the "cache" folder (/mnt/benchfiles/cache/123/) (if i'm not mistaken, PC1 in lotus-worker stores it there too)
cargo run --bin benchy -- single-step -- pc2 --sectornumber 123 --result /mnt/benchfiles # Uses the layer files from the PC1 step, generates its files in the "cache" folder (/mnt/benchfiles/cache/123/) (if i'm not mistaken, PC2 in lotus-worker stores it there)

and so on for C1 / C2

lovel8 commented 1 year ago

@vmx It is recommended to support the following functional requirements:

For performance testing
- Added configuration support for the total number of task cycle executions to verify the stability of the program run and the stability of the calculation efficiency.
- Add support for configuring the number of concurrent task executions in each stage, such as 30 P1s and 4 P2s concurrently, to adapt to real system resources (CPU, GPU, memory resource limitations) and achieve maximum resource utilization.
- Add statistics log of maximum system resource usage during runtime (eg: CPU, GPU, memory) for analysis and optimization.
Positioning for the problem Added support for lotus panic, benchy reruns from the problem phase (eg: P2) to reproduce and debug the problem.

vmx commented 1 year ago

For performance testing

Added configuration support for the total number of task cycle executions to verify the stability of the program run and the stability of the calculation efficiency.

Add support for configuring the number of concurrent task executions in each stage, such as 30 P1s and 4 P2s concurrently, to adapt to real system resources (CPU, GPU, memory resource limitations) and achieve maximum resource utilization.

Add statistics log of maximum system resource usage during runtime (eg: CPU, GPU, memory) for analysis and optimization.

Those are probably out of scope. The idea is to have binaries, so that you can build those tools on-top of it. You could create your own runners that do exactly the testing that you need.

2. Positioning for the problem Added support for lotus panic, benchy reruns from the problem phase (eg: P2) to reproduce and debug the problem.

Yes, ideally it should be possible to run just a certain step on the data you already have.

vmx commented 1 year ago

Some of the requirements re-formulated as user stories:

As a storage provider I'd like to

be able to write my own workflow/scheduling/custom solution, so that I can reach better resource utilization.
be able to be able to stop and resume jobs, so that I can reach better resource utilization.
have more fine grained control which parts of the proving pipeline are run at which point in time, so that I can optimize for different priorities of deals.

If anyone has more, please share them here.

RobQuistNL commented 1 year ago

Yes! :)

Clear documentation (or examples) on how to run the various parts, what data they need & generate, how to pass this data through etc.

In here also the supranational updates would be easier to implement

filecoin-project / rust-fil-proofs