AleoNet / snarkOS

A Decentralized Operating System for ZK Applications
http://snarkos.org
Apache License 2.0
4.32k stars 2.62k forks source link

[Bug] Illegal instruction (core dumped) when run #707

Closed xoptov closed 2 years ago

xoptov commented 3 years ago

🐛 Bug Report

Fatal error!

Illegal instruction (core dumped)

Steps to Reproduce

Just run snarkos in console

Expected Behavior

Exit with the message: Illegal instruction (core dumped)

Your Environment

Branch: staging 76b40b6 Merge pull request #700 from ljedrz/post_mining_thread Rust: 1.51.0 (2fd73fabe 2021-03-23) OS: Ubuntu 20.04

xoptov commented 3 years ago

image

ljedrz commented 3 years ago

Hmm, it's as if the processor didn't understand one of the compiled instructions, which prompts the following additional questions:

  1. Was snarkOS built in the environment it was launched from?
  2. Is this some virtual machine?
weikengchen commented 3 years ago

Today in the Discord channel someone also mentions this issue.

This issue may be related to the fact that our current releases are built with some CPU instruction sets in mind.

Using this tool, https://github.com/pkgw/elfx86exts

We can see that the MacOS release uses the following instruction sets:

MODE64 (push) CMOV (cmova) AVX (vmovdqu) NOVLX (vpand)

We can see that the Linux release uses the following instruction sets:

MODE64 (call) AVX (vmovups) NOVLX (vmovups) CMOV (cmovb) BMI (tzcnt) BMI2 (mulx) SSE2 (mfence) AVX2 (vpmovmskb) SSE41 (pinsrq) SSE1 (movups) PCLMUL (vpclmulqdq) SHA (sha256rnds2) AES (vaesenc) SSE3 (movddup) SSSE3 (pshufb) AVX512 (vmovdqu32) SSE42 (pcmpgtq)

This seems too many and also does not follow the Rust tradition --- Rust's release mode tries to not use any recent instruction sets, unless specifically asked, for compatibility.


To investigate how this occurs, I do the compilation over an AWS Ubuntu 20.02 instance.

This instance has the following CPU flags: "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt"

I compile as in https://github.com/AleoHQ/snarkOS/blob/staging/.github/workflows/release.yml, the instruction sets of the compiled program are:

MODE64 (call) AVX (vmovups) NOVLX (vmovups) CMOV (cmovb) BMI (tzcnt) BMI2 (mulx) SSE2 (mfence) AVX2 (vpmovmskb) SSE41 (pinsrq) SSE1 (movups) PCLMUL (vpclmulqdq) SHA (sha256rnds2) AES (vaesenc) SSE3 (movddup) SSSE3 (pshufb) AVX512 (vmovdqu32) SSE42 (pcmpgtq)

Reading the intermediate log of the building, it seems that some library, not Rust, builds a program that is machine-dependent.

One possibility is that, we use libsodium somewhere in the building (snarkos-network => snow => sodiumoxide => libsodium-sys => libsodium). This library is built using libsodium's ./configure directly, which uses as many instruction sets as possible, and is beyond the control of Rust.


Now I want to reproduce this problem and see if it has something to do with instruction sets.

AWS has a few secret old machines.

I use a c1.xlarge, which has only flags "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm cpuid_fault pti fsgsbase smep erms"

It outputs Illegal instruction (core dumped).

The c3.xlarge instance is the same. It has flags "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti fsgsbase smep erms xsaveopt"

It outputs Illegal instruction (core dumped) as well.

Note that both machines do not have BMI2 and AVX2.

However, one of my computers has BMI2 and AVX2, but not SHA/AVX512, the snarkos can run smoothly, which shows that not all the flags are needed for snarkOS to run (but is unknown --- what if it crashes later?).


Conclusions:

  1. Our Linux build seems to use too many instruction sets. This seems not due to Rust, but due to many C/C++ libraries that Rust calls to, which are being built as well separately.

  2. Not all the instruction sets are needed for the code to run correctly. It seems that AVX2 or BMI2 may be the ones causing the errors.

  3. There are likely many ways that we can handle this issue. One possible one is to not use the feature libsodium-resolver in snow. It seems that snarkos-network may do okay without libsodium.

  4. MacOS build is good, which might be simply due to GitHub's MacOS machine being slightly old.

I will follow up and see if removing the feature libsodium-resolver in snow works. If so, a PR would be made.

@damons

ljedrz commented 3 years ago

@weikengchen great analysis! Since snow 0.8 we might be able to use the default resolver instead, which should speed up compilation too, but I haven't tried doing it yet.

weikengchen commented 3 years ago

I need more testing. For now, I know that just removing the feature libsodium-resolver is not sufficient, since snarkos-network also explicitly uses the sodium resolver.

I need to first confirm whether getting rid of libsodium-resolver helps with how many CPU instruction sets are used.

ljedrz commented 3 years ago

Yes, libsodium-resolver is exactly used in the network, so it would require the handshake code to be adjusted. Please confirm if this is the source of the extra instruction sets, and if it is, I'll try changing the snow resolver for the network.

weikengchen commented 3 years ago

Based on my local test, maybe libsodium is not the problem. Now I am suspecting a different one: librocksdb. Let me take a look at that one first...

weikengchen commented 3 years ago

Because of the following, which shows that AVX2/BMI2 may come from librocksdb.

 cargo run -- ~/snarkOS/target/release/build/librocksdb-sys-5d8e55aecf4c245c/build_script_build-5d8e55aecf4c245c
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/elfx86exts /home/ubuntu/snarkOS/target/release/build/librocksdb-sys-5d8e55aecf4c245c/build_script_build-5d8e55aecf4c245c`
MODE64 (call)
CMOV (cmovae)
SSE1 (movups)
AVX (vmovdqu)
NOVLX (vmovaps)
BWI (vpcmpeqb)
VLX (vpcmpeqb)
BMI (tzcnt)
SSE2 (mfence)
AVX512 (vcvtusi2sd)
AVX2 (vpand)
BMI2 (mulx)
CPU Generation: Unknown
weikengchen commented 3 years ago

Haven't yet found a good solution (nor got rid of these flags)... it is possible that there is yet another library that has something to do with it. Dropping rocksdb => librocksdb-sys => snappy does not help.

weikengchen commented 3 years ago

Update on this story:

  1. A package-by-package analysis reveals rocksdb: https://gist.github.com/weikengchen/333395caa9d02b4e576b1a20aafe8c4b

  2. This is an issue of rust bindings of rocksdb that has appeared for at least two years.

  3. Paritytech encounters this issue previously. They forked the rocksdb and implements a portable feature.

  4. I will try to push a PR to rust-rocksdb, in the hope that a later version can fix it.

ljedrz commented 3 years ago

Judging by the changes in rocksdb 0.17, this issue might get solved when we switch to using that release. Reference: https://github.com/AleoHQ/snarkOS/pull/976.

ljedrz commented 3 years ago

This issue likely no longer applies due to the node not using rocksdb anymore.

ljedrz commented 2 years ago

The testnet2 node is using rocksdb again, but it hasn't been reported in a long while now, so it is likely no longer applicable.