Closed cmacknz closed 3 weeks ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Quoting @rdner on Slack with more context on how to reproduce this:
I managed to reproduce this error locally using the same command (needs to be run in the root of the Beats repo):
docker run --env DEV=true --rm --env GOFLAGS="-mod=readonly -buildvcs=false" --env MAGEFILE_VERBOSE= --env MAGEFILE_TIMEOUT= --env SNAPSHOT=true -v $PWD:/go/src/github.com/elastic/beats -w /go/src/github.com/elastic/beats/x-pack/agentbeat docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm --build-cmd "build/mage-linux-arm64 golangCrossBuild" --platforms linux/arm64
when I tried to revert my repository to https://github.com/elastic/beats/commit/a25c5a5dd79d92e97e1168b1f233419a847bb2b7 (latest successful packaging run on the CI) the command succeeded.
So it's safe to say that this failure is due to https://github.com/elastic/beats/commit/89ed20d5ea412ae913fcff6730d3d1304410a990 I've created a revert PR https://github.com/elastic/beats/pull/41269 (edited)
@cmacknz as this is only happening when we are bumping the GCP SDK to a newer version, should we ask observability to investigate?
It happened with both the AWS SDK bump and the GCP SDK bump, but I'm not sure we can conclude that it has something to do with cloud SDKs. Both of those PRs have a large set of dependencies and it's more likely there is some conflict in an indirect dependency (possibly different each time) triggering a bug in the linker.
This is very likely a bug in the gcc cross-compiler or gold. Checking the docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm
image the toolchain is quite old:
$ ld.gold --version
GNU gold (GNU Binutils for Debian 2.28) 1.14
Copyright (C) 2017 Free Software Foundation, Inc.
$ aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.
$ gcc --version
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.
As you can see, the toolchain is gcc 6, from 2017. Gcc stable is currently in version 14. We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.
Agree with @mauri870's comment; bumping the linker and gcc versions is definitely a good idea.
I've spent some time debugging when I saw it in the other Go 1.23.2 upgrade PR. Initially, I couldn't reproduce it, and everything was working fine even when I tried different Go versions. I was using the command PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package
.
The reproducer shared here does replicate the issue on my setup as well: https://github.com/elastic/beats/issues/41270#issuecomment-2417871431
I compared the Docker commands: the one invoked internally with this command: (PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package
) and the one shared that reproduces the issue.
After experimenting for a while, I noticed that the DEV
var is causing this.
When DEV=true
, I can reproduce the issue. If DEV=false
, I cannot.
So, here's another reproducer:
DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package
This fails too, but DEV=false
works.
I believe the culprit here is -gcflags=all=-N -l
, which is being added here: https://github.com/elastic/beats/blob/7be47da326413fc9ae6e96a0c4b467ba481b6210/dev-tools/mage/build.go#L70
[ Update: I tried exactly with which flag we are getting the issue; it is the -N
; -l
(disabling inlining works) but when disabling the optimization with -N
, it breaks ]
It seems there's an issue with the linker when compiler optimizations are disabled. Could someone also see if DEV=false
is solving this issue in your setup?
I think this also explains why the CI passes when packaging agentbeat because DEV=false
during that step.
For snapshot builds I think we have DEV=true on, but for not staging.
I need to follow more to see where the snapshot DRA artifacts actually get used, if these end up in the official snapshot images we need to turn these off, otherwise the way they are built doesn't match what we eventually release.
Upgrading the cross toolchain is a good idea, I recall we were limited by the version of glibc we needed to build the 7.17 branch on all supported platforms, but there have been a few revisions to the support matrix since then.
@cmacknz is there any case where we need DEV=true
images? I can't come up with any.
It enables using a debugger. I would think locally rebuilding with DEV=true would be acceptable, since this capability is not in the release binaries anyway.
We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.
@mauri870
According to our support matrix https://www.elastic.co/support/matrix, we still support Debian 10 (released on 2019-07-06, EOL 2022-09-10) in the latest version of Beats (8.15.x).
This is the main reason why we're still crossbuilding using this image docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-darwin-arm64-debian10. AFAIK, there is a strict dependency on a certain glibc version when we build the Beats binaries and this is why we have to use Debian 10 for building them. For more details, please read this thread https://github.com/elastic/beats/pull/34921#discussion_r1147469633
@shmsr thank you very much for your investigation, I opened a PR to remove the DEV=true
mode when building snapshots here https://github.com/elastic/beats/pull/41365
This should prevent such failures in the future. Unfortunately, I don't think we can update to a newer Debian version just yet.
The following error has been observed inconsistently in Beats packaging after dependencies updates of the AWS and GCP SDKs, respectively.
This error was resolved by reverted the following two unrelated PRs:
The go.mod changes do not overlap for those two PRs. There is some problem in the dependency graphs that doesn’t jump out obviously that is causing this.
I suspect this is likely related to a change in https://github.com/elastic/golang-crossbuild that only reproduces under specific but infrequent conditions. Possibly it is a bug in
aarch64-linux-gnu-gcc
and updating the version included in the crossbuild image would resolve it.