elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
100 stars 4.92k forks source link

Agentbeat packaging failures: aarch64-linux-gnu/bin/ld.gold: internal error in maybe_apply_stub, at ../../gold/aarch64.cc:5407 #41270

Closed cmacknz closed 3 weeks ago

cmacknz commented 3 weeks ago

The following error has been observed inconsistently in Beats packaging after dependencies updates of the AWS and GCP SDKs, respectively.

# github.com/elastic/beats/v7/x-pack/agentbeat
/usr/local/go/pkg/tool/linux_amd64/link: running aarch64-linux-gnu-gcc failed: exit status 1
/usr/lib/gcc-cross/aarch64-linux-gnu/6/../../../../aarch64-linux-gnu/bin/ld.gold: internal error in maybe_apply_stub, at ../../gold/aarch64.cc:5407
collect2: error: ld returned 1 exit status
Error: running "go build -o build/golang-crossbuild/agentbeat-linux-arm64 -buildmode pie -gcflags=all=-N -l -tags=agentbeat -ldflags -X github.com/elastic/beats/v7/libbeat/version.buildTime=2024-10-03T12:37:52Z -X github.com/elastic/beats/v7/libbeat/version.commit=698951e8c895eff9d03a8d0ececadba3b8b4c6bb" failed with exit code 1

This error was resolved by reverted the following two unrelated PRs:

The go.mod changes do not overlap for those two PRs. There is some problem in the dependency graphs that doesn’t jump out obviously that is causing this.

I suspect this is likely related to a change in https://github.com/elastic/golang-crossbuild that only reproduces under specific but infrequent conditions. Possibly it is a bug in aarch64-linux-gnu-gcc and updating the version included in the crossbuild image would resolve it.

elasticmachine commented 3 weeks ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 3 weeks ago

Quoting @rdner on Slack with more context on how to reproduce this:

I managed to reproduce this error locally using the same command (needs to be run in the root of the Beats repo):

docker run --env DEV=true --rm --env GOFLAGS="-mod=readonly -buildvcs=false" --env MAGEFILE_VERBOSE= --env MAGEFILE_TIMEOUT= --env SNAPSHOT=true -v $PWD:/go/src/github.com/elastic/beats -w /go/src/github.com/elastic/beats/x-pack/agentbeat docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm --build-cmd "build/mage-linux-arm64 golangCrossBuild" --platforms linux/arm64

when I tried to revert my repository to https://github.com/elastic/beats/commit/a25c5a5dd79d92e97e1168b1f233419a847bb2b7 (latest successful packaging run on the CI) the command succeeded.

So it's safe to say that this failure is due to https://github.com/elastic/beats/commit/89ed20d5ea412ae913fcff6730d3d1304410a990 I've created a revert PR https://github.com/elastic/beats/pull/41269 (edited)

pierrehilbert commented 3 weeks ago

@cmacknz as this is only happening when we are bumping the GCP SDK to a newer version, should we ask observability to investigate?

cmacknz commented 3 weeks ago

It happened with both the AWS SDK bump and the GCP SDK bump, but I'm not sure we can conclude that it has something to do with cloud SDKs. Both of those PRs have a large set of dependencies and it's more likely there is some conflict in an indirect dependency (possibly different each time) triggering a bug in the linker.

mauri870 commented 3 weeks ago

This is very likely a bug in the gcc cross-compiler or gold. Checking the docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-arm image the toolchain is quite old:

$ ld.gold --version
GNU gold (GNU Binutils for Debian 2.28) 1.14
Copyright (C) 2017 Free Software Foundation, Inc.

$ aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.

$ gcc --version
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.

As you can see, the toolchain is gcc 6, from 2017. Gcc stable is currently in version 14. We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.

shmsr commented 3 weeks ago

Agree with @mauri870's comment; bumping the linker and gcc versions is definitely a good idea.

I've spent some time debugging when I saw it in the other Go 1.23.2 upgrade PR. Initially, I couldn't reproduce it, and everything was working fine even when I tried different Go versions. I was using the command PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package.

The reproducer shared here does replicate the issue on my setup as well: https://github.com/elastic/beats/issues/41270#issuecomment-2417871431

I compared the Docker commands: the one invoked internally with this command: (PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package) and the one shared that reproduces the issue.

After experimenting for a while, I noticed that the DEV var is causing this.

When DEV=true, I can reproduce the issue. If DEV=false, I cannot.

So, here's another reproducer:

DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage -v package

This fails too, but DEV=false works.

I believe the culprit here is -gcflags=all=-N -l, which is being added here: https://github.com/elastic/beats/blob/7be47da326413fc9ae6e96a0c4b467ba481b6210/dev-tools/mage/build.go#L70 [ Update: I tried exactly with which flag we are getting the issue; it is the -N; -l (disabling inlining works) but when disabling the optimization with -N, it breaks ]

It seems there's an issue with the linker when compiler optimizations are disabled. Could someone also see if DEV=false is solving this issue in your setup?

I think this also explains why the CI passes when packaging agentbeat because DEV=false during that step.

cmacknz commented 3 weeks ago

For snapshot builds I think we have DEV=true on, but for not staging.

https://github.com/elastic/beats/blob/3492089397644e8395f5132f63f7ee60832b5d5a/.buildkite/packaging.pipeline.yml#L82-L91

I need to follow more to see where the snapshot DRA artifacts actually get used, if these end up in the official snapshot images we need to turn these off, otherwise the way they are built doesn't match what we eventually release.

Upgrading the cross toolchain is a good idea, I recall we were limited by the version of glibc we needed to build the 7.17 branch on all supported platforms, but there have been a few revisions to the support matrix since then.

rdner commented 3 weeks ago

@cmacknz is there any case where we need DEV=true images? I can't come up with any.

cmacknz commented 3 weeks ago

It enables using a debugger. I would think locally rebuilding with DEV=true would be acceptable, since this capability is not in the release binaries anyway.

rdner commented 3 weeks ago

We should probably focus on getting these up-to-date, it is likely that it will improve compatibility in general.

@mauri870

According to our support matrix https://www.elastic.co/support/matrix, we still support Debian 10 (released on 2019-07-06, EOL 2022-09-10) in the latest version of Beats (8.15.x).

This is the main reason why we're still crossbuilding using this image docker.elastic.co/beats-dev/golang-crossbuild:1.22.8-darwin-arm64-debian10. AFAIK, there is a strict dependency on a certain glibc version when we build the Beats binaries and this is why we have to use Debian 10 for building them. For more details, please read this thread https://github.com/elastic/beats/pull/34921#discussion_r1147469633

@shmsr thank you very much for your investigation, I opened a PR to remove the DEV=true mode when building snapshots here https://github.com/elastic/beats/pull/41365

This should prevent such failures in the future. Unfortunately, I don't think we can update to a newer Debian version just yet.