dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.37k stars 4.75k forks source link

Decide on linker to use in runtime builds #84794

Closed sbomer closed 1 year ago

sbomer commented 1 year ago

https://github.com/dotnet/runtime/pull/84148 changed the linker that we use in runtime builds (including when running NativeAOT on our binaries like ILCompiler and crossgen2) to lld. This was motivated by a desire to move to cross-builds (mariner host OS, targeting old linux rootfs). Using lld simplifies the cross-compilation scenarios because it is a cross-linker, so doesn't require a separate bfd linker with appropriate target prefix. We have also been wanting to move away from binutils as a dependency. Note that it doesn't change the default linker for customer NativeAOT scenarios, which still use bfd.

There are a few concerns with this approach - see @am11's comments in https://github.com/dotnet/runtime/pull/84493#issuecomment-1502662155, https://github.com/dotnet/runtime/pull/84493#issuecomment-1504203581, https://github.com/dotnet/runtime/pull/84148#issuecomment-1506517575 for full context. In summary:

dotnet new api -aot (~41 K diff)

@am11 I hope that's a fair summary - please add to this if I missed anything.

So I'm opening this to discuss the choice of linker when building our product. Options are:

My preference is to continue using lld, but add separate test legs that validate the customer scenarios using bfd. One challenge is that our NativeAOT tests usually run NativeAOT on the build machine, not on helix - so they are testing a cross-compilation scenario, not the customer scenario. This is simpler than getting all NativeAOT prerequisites onto the target machines to test native compilation. I think it would be worth adding separate tests for NativeAOT compilation that run in an environment that represents customer scenarios (for example, ubuntu with bfd).

I'm opening this issue to continue the discussion and hopefully reach agreement on the approach.

@am11 @janvorli @agocke @MichalStrehovsky @jkotas

ghost commented 1 year ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details
https://github.com/dotnet/runtime/pull/84148 changed the linker that we use in runtime builds (including when running NativeAOT on our binaries like ILCompiler and crossgen2) to `lld`. This was motivated by a desire to move to cross-builds (mariner host OS, targeting old linux rootfs). Using `lld` simplifies the cross-compilation scenarios because it is a cross-linker, so doesn't require a separate `bfd` linker with appropriate target prefix. We have also been wanting to move away from `binutils` as a dependency. Note that it doesn't change the default linker for customer NativeAOT scenarios, which still use `bfd`. There are a few concerns with this approach - see @am11's comments in https://github.com/dotnet/runtime/pull/84493#issuecomment-1502662155, https://github.com/dotnet/runtime/pull/84493#issuecomment-1504203581, https://github.com/dotnet/runtime/pull/84148#issuecomment-1506517575 for full context. In summary: - NativeAOT support for `lld` was only added recently and has not been well tested. - NativeAOT uses `bfd` by default in user scenarios, so we should be testing this in ci. - `lld` introduces size regression compared to `bfd`. Numbers from https://github.com/dotnet/runtime/pull/84493#issuecomment-1504203581: ``` # dotnet new console (~2.2 K diff) - 1496072 + 1498192 # dotnet new api -aot (~41 K diff) - 12743672 + 12785208 ``` @am11 I hope that's a fair summary - please add to this if I missed anything. So I'm opening this to discuss the choice of linker when building our product. Options are: - Use `lld` as we do with https://github.com/dotnet/runtime/pull/84148. - Add `binutils` linker (need separate cross-linker for each supported architecture) to the mariner build images. We will need to pick the right prefixed linker to use based on the target platform. This is not currently supported in NativeAot. My preference is to continue using `lld`, but add separate test legs that validate the customer scenarios using `bfd`. One challenge is that our NativeAOT tests usually run NativeAOT on the build machine, not on helix - so they are testing a cross-compilation scenario, not the customer scenario. This is simpler than getting all NativeAOT prerequisites onto the target machines to test native compilation. I think it would be worth adding separate tests for NativeAOT compilation that run in an environment that represents customer scenarios (for example, ubuntu with `bfd`). I'm opening this issue to continue the discussion and hopefully reach agreement on the approach. @am11 @janvorli @agocke @MichalStrehovsky @jkotas
Author: sbomer
Assignees: -
Labels: `untriaged`, `area-NativeAOT-coreclr`
Milestone: -
agocke commented 1 year ago

I thought there was another consideration: we need a relatively recent version of lld to produce good results. How many distros ship with a recent-enough lld for us to use it as the default?

sbomer commented 1 year ago

We need at least lld 13 to use @am11's fix that improves output size: https://github.com/dotnet/runtime/pull/84493. But I am not suggesting to change the default for customers - that would stay bfd either way. We will be able to use @am11's fix in runtime builds as soon as we update to llvm 16 (which I'm working on).

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.

Issue Details
https://github.com/dotnet/runtime/pull/84148 changed the linker that we use in runtime builds (including when running NativeAOT on our binaries like ILCompiler and crossgen2) to `lld`. This was motivated by a desire to move to cross-builds (mariner host OS, targeting old linux rootfs). Using `lld` simplifies the cross-compilation scenarios because it is a cross-linker, so doesn't require a separate `bfd` linker with appropriate target prefix. We have also been wanting to move away from `binutils` as a dependency. Note that it doesn't change the default linker for customer NativeAOT scenarios, which still use `bfd`. There are a few concerns with this approach - see @am11's comments in https://github.com/dotnet/runtime/pull/84493#issuecomment-1502662155, https://github.com/dotnet/runtime/pull/84493#issuecomment-1504203581, https://github.com/dotnet/runtime/pull/84148#issuecomment-1506517575 for full context. In summary: - NativeAOT support for `lld` was only added recently and has not been well tested. - NativeAOT uses `bfd` by default in user scenarios, so we should be testing this in ci. - `lld` introduces size regression compared to `bfd`. Numbers from https://github.com/dotnet/runtime/pull/84493#issuecomment-1504203581: ``` # dotnet new console (~2.2 K diff) - 1496072 + 1498192 # dotnet new api -aot (~41 K diff) - 12743672 + 12785208 ``` @am11 I hope that's a fair summary - please add to this if I missed anything. So I'm opening this to discuss the choice of linker when building our product. Options are: - Use `lld` as we do with https://github.com/dotnet/runtime/pull/84148. - Add `binutils` linker (need separate cross-linker for each supported architecture) to the mariner build images. We will need to pick the right prefixed linker to use based on the target platform. This is not currently supported in NativeAot. My preference is to continue using `lld`, but add separate test legs that validate the customer scenarios using `bfd`. One challenge is that our NativeAOT tests usually run NativeAOT on the build machine, not on helix - so they are testing a cross-compilation scenario, not the customer scenario. This is simpler than getting all NativeAOT prerequisites onto the target machines to test native compilation. I think it would be worth adding separate tests for NativeAOT compilation that run in an environment that represents customer scenarios (for example, ubuntu with `bfd`). I'm opening this issue to continue the discussion and hopefully reach agreement on the approach. @am11 @janvorli @agocke @MichalStrehovsky @jkotas
Author: sbomer
Assignees: -
Labels: `area-Infrastructure`, `untriaged`, `area-NativeAOT-coreclr`
Milestone: -
am11 commented 1 year ago

Note that it doesn't change the default linker for customer NativeAOT scenarios, which still use bfd.

We recommend clang compiler in docs, we build NativeAOT runtime with llvm-libunwind, we use llvm-based objwriter to create the linkable objects, so lets make lld as a default and document it as a recommended linker for PublishAot for linux. This will be aligned with other aspects and understandable. The upside is that we will be testing with the same component in CI which we recommend for users in docs.

So lets apply the patch, turn default to lld and cleanup LinkerFlavor=lld sprinkled in various places. That will hopefully resolve this matter satisfying everyone.

Alternatives are making less sense to me:

We have also been wanting to move away from binutils as a dependency.

Why is this? binutils is used since .NET Core 1.0, there is no issue with licensing in using binutils to build any software; proprietary or otherwise.

We will need to pick the right prefixed linker to use based on the target platform. This is not currently supported in NativeAot.

This is not correct. The "toolchain" does this job. If configured properly, we don't need to manually specify prefixes for cross-compilation.

How many distros ship with a recent-enough lld for us to use it as the default?

You will find it pretty hard to discover a distro which is not using bfd as their default linker, let alone the versioning concerns. bfd, aka GNU linker, has decades of more experience than other linkers, still actively maintained, choice of top distros and beats other linkers in binary-size benchmarks.

@am11 I hope that's a fair summary - please add to this if I missed anything.

i.e. LinkerFlavor=lld if target==linux constructs added are missing and Compiler!=gcc condition; we set Compiler at the start of build: https://github.com/dotnet/runtime/blob/ed0f1c5d128d9d58079cc706f3e75dedbe80459b/eng/build.sh#L432 (maps to ./build.sh -gcc)

MichalStrehovsky commented 1 year ago

I don't have a strong opinion on this. We're already in a situation where we can't test E2E the thing that will be produced by the toolchain because the version of ld is out of our control. The solution to that could be to just ship lld in the ILCompiler package. But that's a can of worms since constructing the command line to invoke the linker is per-distro bespoke magic and it's also the reason why we use clang as the "linker" instead of invoking ld directly.

Ubuntu 18 comes with lld that is ancient. Ubuntu 22 has a good one. I don't have a good sense of the landscape. If we tell people to use lld but the version we need is hard to obtain, that would be a problem.

I also don't like testing with a completely different linker codebase (bfd vs lld) than what we use on customer machines. We did run into linker issues in the past. At least using the same codebase (even though a different version) would keep things more sane.

kg commented 1 year ago

This has broken builds on Debian for me.

ILCompiler -> /home/kate/Projects/dotnet-runtime-wasm/artifacts/bin/coreclr/linux.x64.Release/ilc/ilc.dll
  Generating native code
clang : error : invalid linker name in argument '-fuse-ld=lld' [/home/kate/Projects/dotnet-runtime-wasm/src/coreclr/tools/aot/ILCompiler/ILCompiler.csproj]
/home/kate/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/build/Microsoft.NETCore.Native.targets(335,5): error MSB3073: The command ""/usr/bin/clang-11" "/home/kate/Projects/dotnet-runtime-wasm/artifacts/obj/coreclr/ILCompiler/x64/Release/native/ilc.o" -o "/home/kate/Projects/dotnet-runtime-wasm/artifacts/bin/coreclr/linux.x64.Release/ilc/native/ilc" -fuse-ld=lld /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/sdk/libbootstrapper.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/sdk/libRuntime.ServerGC.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/sdk/libeventpipe-disabled.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/sdk/libstdc++compat.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/sdk/libnumasupportdynamic.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/framework/libSystem.Native.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/framework/libSystem.Globalization.Native.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/framework/libSystem.IO.Compression.Native.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/framework/libSystem.Net.Security.Native.a /home/kate/.nuget/packages/runtime.linux-x64.microsoft.dotnet.ilcompiler/8.0.0-preview.3.23174.8/framework/libSystem.Security.Cryptography.Native.OpenSsl.a -g -Wl,-rpath,'$ORIGIN' -Wl,--build-id=sha1 -Wl,--as-needed -pthread -ldl -lz -lrt -lm -pie -Wl,-pie -Wl,-z,relro -Wl,-z,now -Wl,--discard-all" exited with code 1. [/home/kate/Projects/dotnet-runtime-wasm/src/coreclr/tools/aot/ILCompiler/ILCompiler.csproj]

I do have llvm, clang, and lld in my /usr/bin, but they have a version number on the end. I tried helping the build process out by using ln -s to put a 'lld' in my PATH pointing at the right one, but that didn't fix it. What do I do?

am11 commented 1 year ago

@kg, this is the situation when we have different versions of clang and lld installed. e.g. if I have clang-16, clang-14 and lld-14, installed, then the following things happen:

  1. ./build.sh via init-compiler will select clang-16 (because it is the latest version)
  2. init-compiler will introspect if fuse-ld=lld and based on the result, will select the linker
  3. then changes in https://github.com/dotnet/runtime/pull/84148 will "force" fuse-ld=lld unconditionally, disregarding #2's introspection based linker selection, and fail the build because we didn't install lld-16

This is exactly how

  • we can no longer build runtime repo with binutils alone.

community-driven scenarios are also affected..

Workaround is to apt install lld-<version of clang which build is picking up>

kg commented 1 year ago

Workaround is to apt install lld-<version of clang which build is picking up>

Thanks, that indeed worked. I did /usr/bin/clang -v which output Debian clang version 11.0.1-2, so sudo apt-get install lld-11 caused the build to succeed.

tmds commented 1 year ago

cc @omajid

sbomer commented 1 year ago

We have also been wanting to move away from binutils as a dependency.

Why is this?

I think mainly for the same reasoning as with lld lld - to simplify our official (cross) builds and avoid having to detect or install prefixed versions of objcopy and other tools. But I agree with your feedback that we should continue supporting binutils-based builds, at least for native builds.

We will need to pick the right prefixed linker to use based on the target platform. This is not currently supported in NativeAot.

This is not correct. The "toolchain" does this job. If configured properly, we don't need to manually specify prefixes for cross-compilation.

Thanks for the correction - fixed above. I got my wires crossed when I was thinking about the gcc support mentioned in https://github.com/dotnet/runtime/issues/78559.

we can no longer build runtime repo with binutils alone.

Filed https://github.com/dotnet/runtime/issues/84934 to track this.

Regarding runtime linker defaults: it sounds like we are all on board with using lld as the default for runtime builds, as long as it's optional.

Regarding customer NativeAot linker defaults: I don't have a strong opinion on what the default should be, but if we consider both of them to be supported, we should have testing in place for both, at least to cover basic scenarios, even if most of our testing uses the default.

agocke commented 1 year ago

Regarding runtime linker defaults: it sounds like we are all on board with using lld as the default for runtime builds, as long as it's optional.

I'm not quite sure what this means. I think lld should be the default for CI specifically, but I think ld should be the default in dev builds, same as the Native AOT default.

sbomer commented 1 year ago

I was thinking that the defaults used in ci should match those for the local dev scenario, but if there's a desire for the local dev scenario to use ld, that's certainly an option. My read of the discussion here was that we were actually leaning towards changing the NativeAOT defaults to use lld in the customer scenario too.

am11 commented 1 year ago

My read of the discussion here was that we were actually leaning towards changing the NativeAOT defaults to use lld in the customer scenario too.

Yup, or more generally, the linker the CI legs are frequently using to test all NativeAOT tests is the linker we should recommend in the docs. In case it wasn't obvious, LinkerFlavor supports all four linkers which clang's/gcc' fuse-ld option recognize: bfd, gold, lld and mold -- and (empirically speaking) they are normally favored in that order by linux devs; gold > lld. Also, fuse-ld=lld is supported by clang7+ and gcc9+, which is old enough to be a viable candidate for default.

agocke commented 1 year ago

I think we could credibly test both linkers. It may not be in PR, but it probably is not a big deal to spin up an ld-specific run of the test suite with Ubuntu-latest-LTS.

It seems like there's two separate considerations -- for our official build we want to keep the dependencies as tight as possible, and using LLVM for everything makes sense.

But unlike with coreclr, the vast majority of linking is going to happen on user machines, so it's not particularly relevant what linker the official build machines use.

In other words, if we think that bfd is a better experience for customers, I don't think it's bad to pick that and use it in most of our testing.

am11 commented 1 year ago

But unlike with coreclr, the vast majority of linking is going to happen on user machines, so it's not particularly relevant what linker the official build machines use.

Yes, no confusion here that this issue is about which linker is used to exercise the tests in CI; it is not about which linker is used to build official coreclr release.

The linker we use in CI is the linker we should recommend for user. It is a "singular" linker, because we normally do not force non-default NativeAOT options in tests. There are number of options which we don't test and they are left as experimental / community-supported; e.g. static lib/exe, no-pie, static-pie etc. The red flag is our usage of LinkerFlavor outside src/nativeaot/BuildIntegration directory introduced in https://github.com/dotnet/runtime/pull/84148 which is a clear deviation from what we set as default.

In other words, if we think that bfd is a better experience for customers, I don't think it's bad to pick that and use it in most of our testing.

Makes sense. Alternatively, we can set lld as default in BuildIntegration for linux, update Microsoft docs to include lld in apt-get install clang ... line, cleanup <LinkerFlavor>lld</LinkerFlavor> instances from rest of the repo and continue testing with lld. Personally, alternative is making more sense to me because:

  1. we are using other llvm components in NativeAOT: objwriter at build-time , llvm-objcopy (preferred with clang) at publish-time, llvm-libunwind at run-time and clang is used to compile NativeAOT Runtime as well as recommended linker-driver in docs.
  2. cross-PublishAot is currently only supported with clang.
  3. since we compile NativeAOT Runtime with clang, using bfd has potential to run into this issue on some platforms: https://github.com/dotnet/runtime/pull/52244 (the sole reason we switched coreclr to build with lld which may or may not be applicable to NativeAOT.. there is no easy way to tell with 100% certainty)
agocke commented 1 year ago

@jkotas @MichalStrehovsky @sbomer How do we feel about changing the default linker to lld?

Presumably we would still use bfd if lld is not installed, but this would become the default.

am11 commented 1 year ago

Presumably we would still use ld

nit: its name is bfd because ld can be anything; ld.lld gold.ld etc.

jkotas commented 1 year ago

How do we feel about changing the default linker to lld?

Fine with me.

MichalStrehovsky commented 1 year ago

How do we feel about changing the default linker to lld?

Sounds good. Only one question:

Presumably we would still use bfd if lld is not installed, but this would become the default.

Is the proposal that we would also check if lld is recent enough and consider it missing if it e.g. doesn't support the command line arguments we want to specify? I want to avoid situations when someone is e.g. on Ubuntu 18, installs lld, and gets worse off than if we used bfd (I might be wrong about this one - I don't know how old distros do we have to care about but do know that lld only became "good" "recently").

am11 commented 1 year ago

Is the proposal that we would also check if lld is recent enough and consider it missing if it e.g. doesn't support the command line arguments we want to specify?

If lld is available, we support it. We already have a version check whether or not use a linker script needed by lld-13+. The rest of the arguments are supported by both old and new versions. We just need a condition (like we have for llvm-objcopy) i.e.; if lld is available on linux, use it otherwise fallback to bfd.

MichalStrehovsky commented 1 year ago

What I meant is that lld is doing worse than bfd size-wise in general, but if we can't do linker script, it's significantly worse and users would be better off using bfd. I'm okay with the regression with recent lld's, but it's more than 5% for older llds and that makes me raise an eyebrow (correct me if I'm wrong).

am11 commented 1 year ago
The size difference with rid: linux-arm64, template: api, configuration: release, stripped: true, executable: true, static-executable: false are mixed (v10 is better than v11): clang/lld version published binary size comments
7 8.8M (9183120) first version to support fuse-ld=lld
8 8.8M (9183120)
9 8.8M (9190264)
10 8.7M (9058800)
11 8.7M (9060144)
12 8.7M (9060152)
13 8.7M (9060152)
14 8.7M (9058552)
15 8.7M (9058552)

I think we can keep it simple. The standing logic is that the newer toolchain brings about better optimizers and analyzers, and older libc is best to increase portability. This combination renders best results on any build machine. (for static executables, old libc is not necessary)

sbomer commented 1 year ago

Closing this since we decided that: