filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.84k stars 1.26k forks source link

checkCommit sanity check error on calibration network #2601

Closed kernelogic closed 4 years ago

kernelogic commented 4 years ago

Getting these errors since the new calibration network. Machine was working fine on the old testnet.

2020-07-25T13:40:11.225 INFO filcrypto::proofs::api > verify_seal: start 2020-07-25T13:40:11.230 INFO filecoin_proofs::api::seal > verify_seal:start 2020-07-25T13:40:11.233 INFO filecoin_proofs::caches > trying parameters memory cache for: STACKED[34359738368]-verifying-key 2020-07-25T13:40:11.233 INFO filecoin_proofs::caches > no params in memory cache for STACKED[34359738368]-verifying-key 2020-07-25T13:40:11.236 INFO storage_proofs_core::parameter_cache > parameter set identifier for cache: layered_drgporep::PublicParams{ graph: stacked_graph::StackedGraph{expansion_degree: 8 base_graph: drgraph::BucketGraph{size: 1073741824; degree: 6; hasher: poseidon_hasher} }, challenges: LayerChallenges { layers: 11, max_count: 18 }, tree: merkletree-poseidon_hasher-8-8-0 } 2020-07-25T13:40:11.237 INFO storage_proofs_core::parameter_cache > ensuring that all ancestor directories for: "/var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.vk" exist 2020-07-25T13:40:11.237 INFO storage_proofs_core::parameter_cache > checking cache_path: "/var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.vk" for verifying key 2020-07-25T13:40:11.307 INFO storage_proofs_core::parameter_cache > read verifying key from cache "/var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.vk" 2020-07-25T13:40:11.308 INFO filecoin_proofs::api::seal > got verifying key (34359738368) while verifying seal 2020-07-25T13:40:11.341 INFO filecoin_proofs::api::seal > verify_seal:finish 2020-07-25T13:40:11.341 INFO filcrypto::proofs::api > verify_seal: finish 2020-07-25T13:40:11.352-0700 WARN sectors storage-fsm@v0.0.0-20200720190000-2cfe2fe3c334/checks.go:145 on-chain sealed CID doesn't match! 2020-07-25T13:40:11.353 INFO filcrypto::proofs::api > verify_seal: start 2020-07-25T13:40:11.353 INFO filecoin_proofs::api::seal > verify_seal:start 2020-07-25T13:40:11.353 INFO filecoin_proofs::caches > trying parameters memory cache for: STACKED[34359738368]-verifying-key 2020-07-25T13:40:11.353 INFO filecoin_proofs::caches > found params in memory cache for STACKED[34359738368]-verifying-key 2020-07-25T13:40:11.353 INFO filecoin_proofs::api::seal > got verifying key (34359738368) while verifying seal 2020-07-25T13:40:11.353 INFO filcrypto::proofs::api > verify_seal: finish 2020-07-25T13:40:11.354-0700 ERROR sectors storage-fsm@v0.0.0-20200720190000-2cfe2fe3c334/fsm.go:26 unhandled sector error (0): checkCommit sanity check error: github.com/filecoin-project/storage-fsm.(*Sealing).handleCommitFailed /home/dev/go/pkg/mod/github.com/filecoin-project/storage-fsm@v0.0.0-20200720190000-2cfe2fe3c334/states_failed.go:184

magik6k commented 4 years ago

Did you overclock your RAM / have XMP enabled? If yes, can you try disabling it and check if that helps?

zhiwei-w-luo commented 4 years ago

I got error ,when the calibration network working on the machine of Miner committing 2020-07-27T21:24:23.395 INFO filcrypto::proofs::api > verify_seal: finish 2020-07-27T21:24:23.395+0800 ERROR sectors storage-fsm@v0.0.0-20200720190000-2cfe2fe3c334/fsm.go:26 unhandled sector error (4): checkCommit sanity check error: github.com/filecoin-project/storage-fsm.(*Sealing).handleCommitFailed /root/go/pkg/mod/github.com/filecoin-project/storage-fsm@v0.0.0-20200720190000-2cfe2fe3c334/states_failed.go:184

RobQuistNL commented 4 years ago

Could you also share the output of these commands?

sudo lshw -C memory
sudo lshw -C cpu
sudo dmidecode -t 2

It shows the modelnumbers / hardware information about your CPU, Motherboard and RAM.

zhiwei-w-luo commented 4 years ago

sudo lshw -C cpu *-cpu
description: CPU product: AMD Ryzen 9 3950X 16-Core Processor vendor: Advanced Micro Devices [AMD] physical id: 34 bus info: cpu@0 version: AMD Ryzen 9 3950X 16-Core Processor serial: Unknown slot: AM4 size: 2028MHz capacity: 3500MHz width: 64 bits clock: 100MHz

root@ubuntu-System-Product-Name:~# sudo dmidecode -t 2

dmidecode 3.2

Getting SMBIOS data from sysfs. SMBIOS 3.2.0 present.

Handle 0x0002, DMI type 2, 15 bytes Base Board Information Manufacturer: ASUSTeK COMPUTER INC. Product Name: PRIME X570-P Version: Rev X.0x Serial Number: 200468461001035 Asset Tag: Default string Features: Board is a hosting board Board is replaceable Location In Chassis: Default string Chassis Handle: 0x0003 Type: Motherboard Contained Object Handles: 0

-firmware
description: BIOS vendor: American Megatrends Inc. physical id: 0 version: 2407 date: 07/01/2020 size: 64KiB capacity: 16MiB capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
-memory description: System Memory physical id: 2e slot: System board or motherboard size: 128GiB

kernelogic commented 4 years ago

Will disable XMP and try again.

zhiwei-w-luo commented 4 years ago

My machine used ASUS mainboard and the D.O.C.P is disabled the error has also appeared .

kernelogic commented 4 years ago

Turning off XMP solved this issue. But it is still very weird why XMP needs to be off. Isn't it designed to be stable?

whyrusleeping commented 4 years ago

@kernelogic that would imply your overclock is not stable. What processor do you have? AMD CPUs are notoriously picky about memory

yangjian102621 commented 4 years ago

@whyrusleeping Hi, we(who report this bug on slack just now) still got this error.

Our architecture is Miner x 1 + p1 worker x 10 + P2C2 worker x 30

Miner and P2C2 worker are the same setups except the miner's RAM is 512GB.

More detail informations for this:

lshw -C memory

 *-firmware                
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: 2.0b
       date: 07/26/2017
       size: 64KiB
       capacity: 15MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-memory:0
       description: System Memory
       physical id: 2b
       slot: System board or motherboard
       capabilities: ecc
       configuration: errordetection=multi-bit-ecc
     *-bank:0
          description: DIMM DDR4 Synchronous 2933 MHz (0.3 ns)
          product: 36ASF4G72PZ-2G9E2
          vendor: Micron
          physical id: 0
          serial: 26C90311
          slot: P1-DIMMA1
          size: 32GiB
          width: 64 bits
          clock: 2933MHz (0.3ns)
     *-bank:1
          description: DIMM DDR4 Synchronous 2933 MHz (0.3 ns)
          product: 36ASF4G72PZ-2G9E2
          vendor: Micron
          physical id: 1
          serial: 26C9031A
          slot: P1-DIMMA2
          size: 32GiB
          width: 64 bits
          clock: 2933MHz (0.3ns)
     *-bank:2
          description: DIMM DDR4 Synchronous 2933 MHz (0.3 ns)
          product: 36ASF4G72PZ-2G9E2
          vendor: Micron
          physical id: 2
          serial: 26C8E1E0
          slot: P1-DIMMB1
          size: 32GiB
          width: 64 bits
          clock: 2933MHz (0.3ns)
     *-bank:3
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          product: M393A4K40BB1-CRC
          vendor: Samsung
          physical id: 3
          serial: 365F159F
          slot: P1-DIMMB2
          size: 32GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)

lshw -C cpu

*-cpu:0                   
       description: CPU
       product: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
       vendor: Intel Corp.
       physical id: 57
       bus info: cpu@0
       version: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
       slot: CPU1
       size: 1200MHz
       capacity: 4GHz
       width: 64 bits
       clock: 100MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d cpufreq
       configuration: cores=16 enabledcores=16 threads=32

sudo dmidecode -t 2

Base Board Information
    Manufacturer: Powerleader
    Product Name: X10DRG-Q
    Version: 1.10
    Serial Number: VM17BS018252
    Asset Tag: Default string
    Features:
        Board is a hosting board
        Board is replaceable
    Location In Chassis: Default string
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0

P1 worker's setup

More detail informations for AMD P1 worker:

lshw -C memory

*-firmware                
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: V3.00
       date: 03/05/2020
       size: 64KiB
       capacity: 15MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 1f
       slot: System board or motherboard
       size: 512GiB
       capacity: 2TiB
       capabilities: ecc
       configuration: errordetection=multi-bit-ecc
     *-bank:0
          description: DIMM DDR4 Synchronous LRDIMM 2667 MHz (0.4 ns)
          product: 72ASS8G72LZ-2G6D2
          vendor: Micron
          physical id: 0
          serial: 1FDA97D9
          slot: P0_UMC0_CH_A0
          size: 64GiB
          width: 64 bits
          clock: 2667MHz (0.4ns)

lshw -C cpu

*-cpu                     
       description: CPU
       product: AMD EPYC 7262 8-Core Processor
       vendor: Advanced Micro Devices [AMD]
       physical id: 25
       bus info: cpu@0
       version: AMD EPYC 7262 8-Core Processor
       serial: Unknown
       slot: P0
       size: 1496MHz
       capacity: 3400MHz
       width: 64 bits
       clock: 100MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca cpufreq
       configuration: cores=8 enabledcores=8 threads=16

dmidecode -t 2

Base Board Information
    Manufacturer: TYAN
    Product Name: S8030GM2NE
    Version: empty
    Serial Number: CXZE2CK1701G
    Asset Tag: empty
    Features:
        Board is a hosting board
        Board is removable
        Board is replaceable
    Location In Chassis: empty
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0

We've tried over 5 times and still got this error frequently. As a result, we have more than 100 machines unable to join the calibration network.