Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

start-stop-gc default change breaks multiple projects with no viable transition option #51351

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR52384
Status NEW
Importance P release blocker
Reported by Jessica Clarke (jrtc27@jrtc27.com)
Reported on 2021-11-02 20:56:12 -0700
Last modified on 2021-11-17 13:44:37 -0800
Version 13.0
Hardware PC All
CC emaste@freebsd.org, harald@gigawatt.nl, i@maskray.me, keithbsmiley@gmail.com, llvm-bugs@lists.llvm.org, smithp352@googlemail.com
Fixed by commit(s)
Attachments
Blocks PR52147
Blocked by
See also
4bbcd63eea4950c38e29f9e29b1d11b7d7894469 (D96914) added a new -z start-stop-gc
to allow the existing "__start_foo retains a section called foo" behaviour to
be disabled, coordinated with GNU ld.

6d2d3bd0a61f5fc7fd9f61f48bc30e9ca77cc619 (committed without review 1.5 months
later, within the same release cycle (13)) changed the default of this option,
thereby breaking software that relied on the old semantics.

So far that list of software is:

* FreeBSD
* NetworkManager
* systemd
* LDC

and 13.0.1 final hasn't even been released yet. For some projects (FreeBSD,
NetworkManager and systemd), build systems can be adjusted to add -z no-start-
stop-gc and restore the traditional behaviour. However, some projects, like
LDC, which in this case is behaving like a driver for LD and thus, just like
Clang, cannot easily know what exact linker (GNU vs LLVM, and version) is being
used at run time, and thus cannot know whether the option even exists (so
cannot just unconditionally pass it without getting an error).

I do not believe this is an appropriate way to go about this transition, and so
the default change in 6d2d3bd0a61f5fc7fd9f61f48bc30e9ca77cc619 should be
reverted, both in main and in release/13.x, until such time as a proper
transition plan is put forward, akin to how -f(no-)common was handled (except
this time the errors aren't even bugs in the software, they're "your software
was fine but needs to change because we decided that behaviour is no longer
supported by default") where the option exists for enough years before the
default is flipped such that it can just be assumed to exist and the old
behaviour requested unconditionally for projects that need it.

There is the separate issue of the fact that such a plan still breaks static
libraries that wish to use linker sets, as it's up to the consumer of the
library to decide what semantics it wants, not the library (and I cannot
foresee (a) requiring people to use pkgconf-like mechanisms (b) those adding a -
z option to LDFLAGS being a good idea).

https://reviews.llvm.org/D96914 has a lengthier discussion that lead up to this
bug report, and continues.
Quuxplusone commented 3 years ago
tl;dr I strongly object to a revert.

If needed, we can consider adding workarounds (like for glibc) to 13.0.1.
The "blast radius" is small and has been overstated.

> So far that list of software is:
>
> * FreeBSD
> * NetworkManager
> * systemd
> * LDC

This is not accurate.

* FreeBSD (supposedly linker sets): evidence required. It had coordination with
emaste and dim
* NetworkManager (I have no idea)
* systemd: v248 onwards is compatible with ld.lld>=13.0.0 . This is unlikely
causing any distribution a problem (distros linking systemd with LLD is few.)
* LDC: https://github.com/ldc-developers/ldc/issues/3861 (with a usage model
not very recommended; see the lengthy discussion on
https://reviews.llvm.org/D96914)

> I do not believe this is an appropriate way to go about this transition, and
so the default change in 6d2d3bd0a61f5fc7fd9f61f48bc30e9ca77cc619 should be
reverted, both in main and in release/13.x, until such time as a proper
transition plan is put forward, akin to how -f(no-)common was handled (except
this time the errors aren't even bugs in the software, they're "your software
was fine but needs to change because we decided that behaviour is no longer
supported by default") where the option exists for enough years before the
default is flipped such that it can just be assumed to exist and the old
behaviour requested unconditionally for projects that need it.

See the lengthy discussion on https://reviews.llvm.org/D96914 .

My attitude is that some folks overly react and a bit overstate the problem.
C identifier section usage + --gc-sections using GNU ld>=2015-10 behavior
has a much much smaller impact than -f(no-)common.

The window of time with potentially problems is finite and evidences from some
large LLD adopters (Android, Chrome OS, Fuchsia, Sony, Meta, Alphabet) have
suggested that we've mostly passed the window.
Changing the behavior again only does more harm and more harm to metadata
section users.
They are typical "rolling-release" llvm-project users and they may observer
random binary size increase.
Quuxplusone commented 3 years ago

To summarise my views:

Note: provided the LLD change can be reverted for now and there is agreement that it will not be reinstated at least until GCC is fixed, I would be happy to help out in getting GCC fixed.

Quuxplusone commented 3 years ago

with no viable transition option

This is largely overstated. Removing -Wl,--gc-sections is an apparent option. It is broad and may be considered unfortunate. But sanitizers/PGO in the past did use this option to work around GNU ld or gold problems.

Adding -Wl,-z,nostart-stop-gc is a much better option. GNU ld doesn't even error for unrecognized options. There is a warning but it does no harm. ld.lld < 13.0.0 errors for -z nostart-stop-gc which could be unfortunate, but it doesn't make the configure-time check "unviable". A configure-time check is usually not too much burden given that many linker options need to be tested for portability anyway.

  • As mentioned in D96914, the approval of the original commit was conditional on the default behaviour remaining unchanged, and the default behaviour was then changed without additional review. That alone should be sufficient to warrant a revert even if no software was found to be broken by the change.

I did report to a few groups, including a SN systems linker person, Chrome OS, and FreeBSD. (I failed to report to Fuchsia. I think I likely notified FB though not 100% sure.)

git shortlog -sn -- test/ELF ELF (now I have a high rank. Sorry, I didn't want to state this) Rafael Espindola is a main contributor and has changed several GC aspects regarding _start/_stop symbols without review. The commit-without-review practice can surely be discussed but "warrant a revert even if no software was found to be broken by the change" definitely needs a re-think before you take it for granted. My commit could use an explicit review to make folks without the full context feel better. That was my mistake. While I agree that this switch commit is a revert candidate, for a variety of reasons I don't think revert is the right call.

  • Regardless of the above, the change to the default should not have been made until such time that has_attribute(retain) was made to work properly in GCC, so that code would have a reasonable upgrade path. __has_attribute(retain__) being effectively broken in GCC had already been reported by @MaskRay well before the default was flipped, so this issue was already known when that change was made.

For context, it is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99587 However, I don't think it's a blocker for LLD behavior change. A piece of software can do

define retain _Pragma("GCC diagnostic ignored \"-Wattributes\"") attribute((used,retain))

to work with Clang and newer GCC. I know that this may do nothing with binutils and in this area Clang maintains 5 years even longer compatibility for buggy binutils behavior. As said, C identifier name section usage is low, much much smaller than COMMON symbols. We cannot wait for so long, especially that (again) the "blast radius" is small and there are good workarounds and Clang works fine.

Nowadays Clang+LLD is the main user and expects certain feature sets from LLD. There are some GCC+LLD users and we if reasonable we want to make that as smooth as possible. But if GCC folks are not engaged in fixing something, I don't think we should let the main Clang+LLD use case hurt if the GCC side issue is not really a blocker (as in this case).

  • If the change is not reverted, documentation should be updated to clearly document this in the release notes not just as a single line under "ELF Improvements", but under "Breaking changes" along with a simple explanation of how to modify code to work with the new behaviour.

I can prepare a change to move it to "Breaking changes".

I am also happy to add sec->name.startswith("__libc_") like workarounds for NetworkManager, if you tell me that sections it needs. You never link to the NetworkManager issue, which I think would be the productive way forward.

Note: provided the LLD change can be reverted for now and there is agreement that it will not be reinstated at least until GCC is fixed, I would be happy to help out in getting GCC fixed.

While I disagree with your request about revert, I'd appreciate if you can fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99587 or just let them give more attention on the issue.

Quuxplusone commented 3 years ago

(In reply to Fangrui Song from comment #3)

with no viable transition option

This is largely overstated. Removing -Wl,--gc-sections is an apparent option. It is broad and may be considered unfortunate. But sanitizers/PGO in the past did use this option to work around GNU ld or gold problems.

Adding -Wl,-z,nostart-stop-gc is a much better option. GNU ld doesn't even error for unrecognized options. There is a warning but it does no harm. ld.lld < 13.0.0 errors for -z nostart-stop-gc which could be unfortunate, but it doesn't make the configure-time check "unviable".

For the umpteenth time, you cannot write a configure-time check for a run-time property. If a toolchain is built on a system with LLD < 13 then it thinks the option is not supported, but if that build is then distributed to a user that has LLD >= 13 it is broken by the change of semantics. This is the problem you keep ignoring.

Quuxplusone commented 3 years ago

(In reply to Jessica Clarke from comment #4)

(In reply to Fangrui Song from comment #3)

with no viable transition option

This is largely overstated. Removing -Wl,--gc-sections is an apparent option. It is broad and may be considered unfortunate. But sanitizers/PGO in the past did use this option to work around GNU ld or gold problems.

Adding -Wl,-z,nostart-stop-gc is a much better option. GNU ld doesn't even error for unrecognized options. There is a warning but it does no harm. ld.lld < 13.0.0 errors for -z nostart-stop-gc which could be unfortunate, but it doesn't make the configure-time check "unviable".

For the umpteenth time, you cannot write a configure-time check for a run-time property. If a toolchain is built on a system with LLD < 13 then it thinks the option is not supported, but if that build is then distributed to a user that has LLD >= 13 it is broken by the change of semantics.

This is the problem you keep ignoring.

I think we probably often talk past each other. I have spent/wasted a large portion of the afternoon and night on repeating my points and felt quite tired now.

I have repeatedly said that such usage pattern falls on the edge of a supported interface. I cannot comfortably say this is supported, even it did work for some cases, Sure, you can always test the limit of toolchain compatibility and find this unappealing, but this is very very rare.

Quuxplusone commented 3 years ago
(In reply to Fangrui Song from comment #3)
> > - As mentioned in D96914, the approval of the original commit was
conditional on the default behaviour remaining unchanged, and the default
behaviour was then changed without additional review. That alone should be
sufficient to warrant a revert even if no software was found to be broken by
the change.
>
> I did report to a few groups, including a SN systems linker person, Chrome
> OS, and FreeBSD.
> (I failed to report to Fuchsia. I think I likely notified FB though not 100%
> sure.)

But you didn't notify the multiple people who specifically told you beforehad
that they foresaw problems if that default were to be changed.

> The commit-without-review practice can surely be discussed but "warrant a
> revert even if no software was found to be broken by the change" definitely
> needs a re-think before you take it for granted.

To be clear, I meant in a "you shouldn't have pushed, you should revert it",
not a "you weren't allowed to push, anyone can revert it" sense. To bypass
relevant people comes across badly and was needlessly antagonising
(unintentionally, I'm sure, but that's the effect it had). As far as
permissions go, sure, you're allowed to. At the same time @tstellar is allowed
to revert this on the release branch if he is so inclined. I do not know
whether he is and that's not the discussion we should be having.

> For context, it is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99587
> However, I don't think it's a blocker for LLD behavior change. A piece of
> software can do
>
>   #define retain _Pragma("GCC diagnostic ignored \"-Wattributes\"")
> __attribute__((used,retain))

_Pragma can only appear in specific syntactic positions, not anywhere
__attribute__ can appear, so this does not work as a macro definition. Also
this suppresses -Wattributes for everything, while some software (not
NetworkManager) intentionally builds with -Werror=attributes, so warnings for
other attributes would ideally not be suppressed.

Our documentation says "Query for this attribute with __has_attribute(*)"
(https://clang.llvm.org/docs/AttributeReference.html) for most other
attributes. That's the way we should be able to recommend here as well:

#if __has_attribute(__retain__)
#define retain __attribute__((__retain__))
#else
#define retain
#endif

That's what we currently cannot recommend.

> Nowadays Clang+LLD is the main user and expects certain feature sets from
> LLD.
> There are some GCC+LLD users and we if reasonable we want to make that as
> smooth as possible.
> But if GCC folks are not engaged in fixing something, I don't think we
> should let the main Clang+LLD use case hurt if
> the GCC side issue is not really a blocker (as in this case).

This is again needlessly antagonising. There has been no indication on the GCC
bug report that the LLD default would change, or has changed, and that this
would therefore become important to address sooner.

> I can prepare a change to move it to "Breaking changes".
>
> I am also happy to add `sec->name.startswith("__libc_")`
> like workarounds for NetworkManager, if you tell me that sections it needs.
> You never link to the NetworkManager issue, which I think would be the
> productive way forward.

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1008

The section name in NetworkManager is "connection_defaults", not something that
LLD should hardcode.

> > Note: provided the LLD change can be reverted for now and there is
agreement that it will not be reinstated at least until GCC is fixed, I would
be happy to help out in getting GCC fixed.
>
> While I disagree with your request about revert, I'd appreciate if you can
> fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99587 or just let them give
> more attention on the issue.

The priority for me depends on whether this gets reverted in LLD first. If it
gets reverted first, we have time to fix it properly. If it doesn't, software
that needs updating will need updating in a way that handles GCC 11.2 anyway,
fixing it in GCC doesn't help with that, so that will be what I'll look into
first.