filecoin-project / rust-fil-proofs

Proofs for Filecoin in Rust
Other
489 stars 314 forks source link

SupraSeal test issues #1736

Closed cryptonemo closed 9 months ago

cryptonemo commented 10 months ago

Description

Seeing failures on local hardware during tests when cuda-supraseal is enabled:

Example failure from miner-1

test test_seal_lifecycle_upgrade_4kib_base_8 ... ok

failures:

---- test_seal_lifecycle_upgrade_16kib_base_8 stdout ----
Error: not on curve

failures:
    test_seal_lifecycle_upgrade_16kib_base_8

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 29 filtered out; finished in 19.50s

Separate failure also on miner-1 (note that a previous failure spotted passes in this run -- which means that this could be a hardware issue)

test test_seal_lifecycle_upgrade_16kib_base_8 ... ok

failures:

---- test_seal_lifecycle_upgrade_2kib_base_8 stdout ----
Error: Compound proof failed to verify

failures:
    test_seal_lifecycle_upgrade_2kib_base_8

Example failure from local machine:

test test_seal_lifecycle_upgrade_16kib_base_8 ... ok

failures:

---- test_seal_lifecycle_upgrade_2kib_base_8 stdout ----
Error: Compound proof failed to verify

failures:
    test_seal_lifecycle_upgrade_2kib_base_8

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 29 filtered out; finished in 35.65s

error: test failed, to rerun pass `-p filecoin-proofs --test api`

Note that on both machines using cuda works 100% every time (I cannot get it to fail, even with repeated runs).

Acceptance criteria

Risks + pitfalls

Where to begin

vmx commented 10 months ago

This sounds like https://github.com/supranational/supra_seal/issues/32, can you please check kernel and GCC versions (and possibly try other ones)?

vmx commented 10 months ago

For me this works:

$ RUST_LOG=trace cargo test --release --features cuda-supraseal --test api test_seal_lifecycle_upgrade_2kib_base_8 -- --ignored
…
test test_seal_lifecycle_upgrade_2kib_base_8 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 31 filtered out; finished in 11.33s

I'm on a machine (worker-gpu-7) with:

$ uname -a
Linux worker-gpu-7 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

Also works flawless on miner-2:

$ uname -a
Linux miner-2 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
vmx commented 9 months ago

I think we've attributed that to an environment issue, hence I close this issue. If it still needs more attention, feel free to re-open this issue.