Closed roysc closed 1 year ago
This solution to this would be for us to use our own machines for building, I'm confused as to why you're using github for this. @dboreham what's stopping us from building this off github?
This is happening in CI jobs for ipld-eth-server, which hasn't been migrated to Gitea, but if we have CI infra ready on Gitea I can migrate it.
We have the Gitea CI infra, but Gitea broke Actions for mirrored repos in the release we're running. The fix was merged to main yesterday. I'm working on making a patched build for Shane to deploy, since running mystery meat main branch code seems like not a good idea.
As far as the root cause, I wonder if Lighthouse people never implemented their fix? (because if they did, we wouldn't see the illegal instruction exception -- we're using their only container image for lcli, built on only a few weeks ago and according to their commit history the problem was fixed in 2020).
Here: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/lcli/Dockerfile#L8 it provides for building a binary that runs on all x86 CPUs, but I can find no evidence that they actually set that (default seems to be to not build the portable binary).
We could build the lcli container ourselves with the flag turned on and see if that fixes the problem.
It's passed by default into the build for the lcli image: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L145
And as part of the target suffix building the lighthouse image, bit convoluted https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L78
Presumably they've introduced another version of the problem somehow? We should be running the right binary already.
Yeah it doesn't quite make sense. I was speculating it could be an issue of shared libraries, since we are running lcli on the non-portable lighthouse
container - but Rust would be statically linked by default.
Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.
What makes you suspect we're not using the portable lighthouse container? This is the container we're using: https://github.com/cerc-io/stack-orchestrator/blob/main/app/data/container-build/cerc-lighthouse/Dockerfile#L1
Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.
Presumably it happens when GH picks a runner that has a funky CPU for the job.
Presumably it happens when GH picks a runner that has a funky CPU for the job.
Right, that's what I figured
What makes you suspect we're not using the portable lighthouse container?
From my reading of their docker actions, the -modern
suffix is used for the non-portable builds https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L96
Ah, so we want "not modern" to fix this problem. I was thinking the other way around. All pretty frustrating since it's due to someone not understanding how to ship a binary that selects its vectorized code according to the CPU it's actually running on.
If "modern" has worked for us so far, perhaps you're right in aiming to avoid running on an old CPU in CI as the optimal solution. Back to Gitea...
Fix for the Gitea Actions in mirrored repos bug has been merged into the 1.19 branch, so we can deploy (and get CI working again on git/vdb.to).
lol this is from our Gitea CI just now:
./build_cl.sh: line 48: 63 Illegal instruction (core dumped) lcli insecure-validators --count $VALIDATOR_COUNT --base-dir $DATADIR --node-count $BN_COUNT
make: *** [Makefile:9: genesis-cl] Error 132
CPU in the case above is: https://www.intel.com/content/www/us/en/products/sku/64589/intel-xeon-processor-e52667-15m-cache-2-90-ghz-8-00-gts-intel-qpi/specifications.html Has only AVX.
I will see if I can figure out exactly what is going on, since the lcli binary is supposedly compiled for a generic x86-64 target.
This is now happening consistently on our (resurrected) CI infra, e.g. : https://git.vdb.to/cerc-io/stack-orchestrator/actions/runs/276
Going to see if I can get a VM on the same host to debug on.
Reproduced by extracting the lcli
binary from the latest container image then running it with this command on an Atom CPU:
$ ./lcli insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)
I built the container myself, to check if there is some build provenance issue with the image. Still fails:
$ ./lcli-test insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)
So the problem is that the "portable" container build is in fact not portable.
It turns out that their PORTABLE
flag is not implemented:
$ grep -R PORTABLE .
./.github/workflows/docker.yml: --build-arg PORTABLE=true \
./lcli/Dockerfile:ARG PORTABLE
./lcli/Dockerfile:ENV PORTABLE $PORTABLE
The lighthouse (not lcli) build has targets of the form xxxx-portable, which likely are implemented properly.
The result is that there is no such thing as a portable/non-modern lcli build.
Filed bug against lighthouse: https://github.com/sigp/lighthouse/issues/4370
The most obvious solution is for us to build our own lcli container image, which we need to do for ARM support anyway.
With this fix, the container is built with portable binaries:
diff --git a/lcli/Dockerfile b/lcli/Dockerfile
index 98f33f215..485ebfd5b 100644
--- a/lcli/Dockerfile
+++ b/lcli/Dockerfile
@@ -6,7 +6,7 @@ RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-de
COPY . lighthouse
ARG PORTABLE
ENV PORTABLE $PORTABLE
-RUN cd lighthouse && make install-lcli
+RUN if [ "$PORTABLE" = true ] ; then export FEATURES=portable; fi && cd lighthouse && make install-lcli
FROM ubuntu:22.04
RUN apt-get update && apt-get -y upgrade && apt-get clean && rm -rf /var/lib/apt/lists/*
Now we're building our own lcli container this issue should be resolved. Closing.
The build script for this image is intermittently hitting SIGILL on Github action runners when running
lcli insecure-validators
. Likely related to https://github.com/sigp/lighthouse/issues/1395 and if so should be fixed by using the portable docker images instead of the native ones.