cerc-io / stack-orchestrator

Read-only mirror of https://git.vdb.to/cerc-io/stack-orchestrator
https://git.vdb.to/cerc-io/stack-orchestrator
GNU Affero General Public License v3.0
29 stars 19 forks source link

Illegal instruction crash in lighthouse cli on some CPUs #416

Closed roysc closed 1 year ago

roysc commented 1 year ago

The build script for this image is intermittently hitting SIGILL on Github action runners when running lcli insecure-validators. Likely related to https://github.com/sigp/lighthouse/issues/1395 and if so should be fixed by using the portable docker images instead of the native ones.

AFDudley commented 1 year ago

This solution to this would be for us to use our own machines for building, I'm confused as to why you're using github for this. @dboreham what's stopping us from building this off github?

roysc commented 1 year ago

This is happening in CI jobs for ipld-eth-server, which hasn't been migrated to Gitea, but if we have CI infra ready on Gitea I can migrate it.

dboreham commented 1 year ago

We have the Gitea CI infra, but Gitea broke Actions for mirrored repos in the release we're running. The fix was merged to main yesterday. I'm working on making a patched build for Shane to deploy, since running mystery meat main branch code seems like not a good idea.

dboreham commented 1 year ago

As far as the root cause, I wonder if Lighthouse people never implemented their fix? (because if they did, we wouldn't see the illegal instruction exception -- we're using their only container image for lcli, built on only a few weeks ago and according to their commit history the problem was fixed in 2020).

Here: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/lcli/Dockerfile#L8 it provides for building a binary that runs on all x86 CPUs, but I can find no evidence that they actually set that (default seems to be to not build the portable binary).

We could build the lcli container ourselves with the flag turned on and see if that fixes the problem.

roysc commented 1 year ago

It's passed by default into the build for the lcli image: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L145

And as part of the target suffix building the lighthouse image, bit convoluted https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L78

dboreham commented 1 year ago

Presumably they've introduced another version of the problem somehow? We should be running the right binary already.

roysc commented 1 year ago

Yeah it doesn't quite make sense. I was speculating it could be an issue of shared libraries, since we are running lcli on the non-portable lighthouse container - but Rust would be statically linked by default.

Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.

dboreham commented 1 year ago

What makes you suspect we're not using the portable lighthouse container? This is the container we're using: https://github.com/cerc-io/stack-orchestrator/blob/main/app/data/container-build/cerc-lighthouse/Dockerfile#L1

dboreham commented 1 year ago

Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.

Presumably it happens when GH picks a runner that has a funky CPU for the job.

roysc commented 1 year ago

Presumably it happens when GH picks a runner that has a funky CPU for the job.

Right, that's what I figured

What makes you suspect we're not using the portable lighthouse container?

From my reading of their docker actions, the -modern suffix is used for the non-portable builds https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L96

dboreham commented 1 year ago

Ah, so we want "not modern" to fix this problem. I was thinking the other way around. All pretty frustrating since it's due to someone not understanding how to ship a binary that selects its vectorized code according to the CPU it's actually running on.

If "modern" has worked for us so far, perhaps you're right in aiming to avoid running on an old CPU in CI as the optimal solution. Back to Gitea...

dboreham commented 1 year ago

Fix for the Gitea Actions in mirrored repos bug has been merged into the 1.19 branch, so we can deploy (and get CI working again on git/vdb.to).

https://github.com/cerc-io/hosting/issues/40

dboreham commented 1 year ago

lol this is from our Gitea CI just now:

./build_cl.sh: line 48:    63 Illegal instruction     (core dumped) lcli insecure-validators --count $VALIDATOR_COUNT --base-dir $DATADIR --node-count $BN_COUNT
make: *** [Makefile:9: genesis-cl] Error 132
dboreham commented 1 year ago

CPU in the case above is: https://www.intel.com/content/www/us/en/products/sku/64589/intel-xeon-processor-e52667-15m-cache-2-90-ghz-8-00-gts-intel-qpi/specifications.html Has only AVX.

I will see if I can figure out exactly what is going on, since the lcli binary is supposedly compiled for a generic x86-64 target.

dboreham commented 1 year ago

This is now happening consistently on our (resurrected) CI infra, e.g. : https://git.vdb.to/cerc-io/stack-orchestrator/actions/runs/276

Going to see if I can get a VM on the same host to debug on.

dboreham commented 1 year ago

Reproduced by extracting the lcli binary from the latest container image then running it with this command on an Atom CPU:

$  ./lcli insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)
dboreham commented 1 year ago

I built the container myself, to check if there is some build provenance issue with the image. Still fails:

$ ./lcli-test insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)
dboreham commented 1 year ago

So the problem is that the "portable" container build is in fact not portable.

dboreham commented 1 year ago

It turns out that their PORTABLE flag is not implemented:

$ grep -R PORTABLE .
./.github/workflows/docker.yml:                      --build-arg PORTABLE=true \
./lcli/Dockerfile:ARG PORTABLE
./lcli/Dockerfile:ENV PORTABLE $PORTABLE

The lighthouse (not lcli) build has targets of the form xxxx-portable, which likely are implemented properly.

The result is that there is no such thing as a portable/non-modern lcli build.

dboreham commented 1 year ago

Filed bug against lighthouse: https://github.com/sigp/lighthouse/issues/4370

dboreham commented 1 year ago

The most obvious solution is for us to build our own lcli container image, which we need to do for ARM support anyway.

dboreham commented 1 year ago

https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/lcli/Cargo.toml#L9

dboreham commented 1 year ago

https://github.com/supranational/blst/blob/master/bindings/rust/Cargo.toml#LL32C14-L32C14

dboreham commented 1 year ago

https://github.com/supranational/blst/blob/2f2ce71b71b00a97fe5a449c591817c023ea44d5/bindings/rust/build.rs#L107

dboreham commented 1 year ago

With this fix, the container is built with portable binaries:

diff --git a/lcli/Dockerfile b/lcli/Dockerfile
index 98f33f215..485ebfd5b 100644
--- a/lcli/Dockerfile
+++ b/lcli/Dockerfile
@@ -6,7 +6,7 @@ RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-de
 COPY . lighthouse
 ARG PORTABLE
 ENV PORTABLE $PORTABLE
-RUN cd lighthouse && make install-lcli
+RUN if [ "$PORTABLE" = true ] ; then export FEATURES=portable; fi && cd lighthouse && make install-lcli

 FROM ubuntu:22.04
 RUN apt-get update && apt-get -y upgrade && apt-get clean && rm -rf /var/lib/apt/lists/*
dboreham commented 1 year ago

Now we're building our own lcli container this issue should be resolved. Closing.