ValveSoftware / steam-runtime

A runtime environment for Steam applications
Other
1.18k stars 86 forks source link

GNU symbol versioning long-term issues #383

Closed sylware closed 1 year ago

sylware commented 3 years ago

Compiling against a library with GNU symbol versioning on recent glibc/linux distros will produce binaries which will break on not-that-old glibc/linux distros. I don't know if it is an upstream symbol versioning bug, or it is by design and done on purpose. Should the steam runtime start to be build against glibc/libraries without gnu symbol versioning enabled? Should it be considered a significant addition of static linking in the runtime? Because devs are building their linux version of games on "latest of the latest" mainstream distros, which does break on not "glibc-state-of-the-art" distros now.

smcv commented 3 years ago

devs are building their linux version of games on "latest of the latest" mainstream distros

They should not be doing that. The design of the Steam Runtime is that game developers should be compiling their games in the Steam Runtime SDK, which is older than your distribution.

If game developers are not building their games in the recommended way, then the Steam Runtime can't help you to run them.

This is not really specific to GNU symbol versioning: even if we didn't have symbol versioning, games built in newer environments would be using individual symbols from their newer libraries that don't exist in the version of the same library found in your older distribution.

Should the steam runtime start to be build against glibc/libraries without gnu symbol versioning enabled?

No. If we did, that wouldn't solve anything: if game developers are building against non-Steam-Runtime dependencies, then changing the Steam Runtime is not going to affect their binaries.

(Dropping GNU symbol versioning would break existing games, though.)

Should it be considered a significant addition of static linking in the runtime?

We can't link the most important libraries statically, for two main reasons:

sylware commented 2 years ago

I see only one way for games in the future to avoid being abused by non-pertinent planned obsolecence:

distributed game binary code, shared libs or executables, should be pure ELF and statically load only libdl (in ELF DT_NEEDED subsection), the libc must not be there. Then they should go thru dlopen/dlsym/dlclose if using anything from the use system (even glibc libs would have to be libdl-ed). Quite more work and since c++ is c++, compiling/linking with -static-libstdc++ will probably break that since I don't think gcc/libstdc++ is properly libdl-ing anything from the glibc. -static-libgcc should be safe for now I guess, but I don't know for gcc startup code object files.

Since in glibc2.34, even ELF->C runtime bridging code was versioned (from suse linux), game code would not have to be linked with [S]crtX.o objects.

I cannot find the targetted goal of this abuse of GNU symbol versioning except blunt planned obsolescence. Maybe explanation and rational of this behavior are very hard to find, like lost in the middle of thousands of mailing-list messages.

I should find myself a "system v abi" document and read the stuf about DT_INIT_ARRAY and DT_FINI_ARRAY, since the glibc stuff would have to be inited there (I suspect those ELF sections cannot init properly all glibc services).

sylware commented 2 years ago

ok, the stuff from the ELF->C runtime bridging code should deal only with statically linked binaries.

nickalcock commented 2 years ago

Doing that requires reimplementing a large part of glibc/elf/, and more: you'd need to make sure that your ld.so replacement was compatible with every version of glibc out there. This is an insane amount of work (as in, the glibc developers have never even considered attempting it outside of idle daydreams) and may well be impossible: ld.so and glibc are tightly tied in all sorts of ways, and you cannot update them independently. What you're suggesting is tantamount to updating them independently every time.

It seems likely to be much easier to use LD_PRELOAD or a library built with -Wl,--auxiliary to override the csu definition of libc_start_main appropriately for older glibc releases (providing an implementation of libc_start_main@@GLIBC_2.34 which does what glibc 2.34's does). That way you only have to keep up with changes to 500-odd lines of moderately hairy code called only once rather than thousands of lines of utterly critical very hairy arch-dependent code that changes frequently and has countless subtle dependencies on the rest of glibc.

sylware commented 2 years ago

Yep, I was abstracting away the SDK, which probably one of the worst thing in GNU. well, after getting deeper, this is what I think of: distributed binaries (exe and so), should be pure ELF64, which should be libdl-ing everything from the system (including the libc). Namely, the only statically loaded system lib (in proper ELF section) should be libdl and nothing else, with external symbols which are undefined being only dlopen/dlsym/dlclose. Additionally, ELF TLS (thread local storage) symbol should not be used, and pthread TLS should be used instead (because ELF TLS symbol support needs to parse the ELF structures of the binaries in order to get "offsets" and then call the ABI defined TLS function). c++ being c++, those who made the choice to use it will have to fork their static c++ runtime (I guess gcc or clang), to ensure it "libdl"s everything it needs from the system (aka from the libc).

nickalcock commented 2 years ago

sylware: alas that doesn't work either: libdl isn't a real library in any meaningful sense in glibc 2.1+: it's just a pile of incredibly intricate wrappers for machinery actually residing in libc and (mostly) in ld.so; so if you don't have both in place when you call dlopen functions, you're in real trouble and libdl simply won't work (and since ld.so and libc are tightly coupled, unless you've avoided all the ld.so startup machinery and csu/ you'll be constrained to use a single specific version of libc and ld.so that matches the csu/ startup code you linked against, which is far worse than what we have now: and if you have somehow avoided using the glibc csu/ stuff for startup, you can't use either ld.so or libc at all). Also, errno is TLS, and you can't avoid using that! Even if you try, the libc code obviously tries to write to it, and if it's not there, boom. (There's also a lot of locking activity in even simple things, and that involves TLS stuff too IIRC. You'd certainly have to avoid e.g. malloc... and if you're avoiding all of that, I'm wondering why you're trying to use a libc at all).

Also also dlopen and versioned symbols, well, that way lies pain, even with dlvsym. dlopening libc is really really difficult (I've done it, it's awful, it breaks a lot, and I was doing it in an alternate symbol namespace via dlmopen: doing it via dlopen, well, if you try you'll find there is a special case that prevents actually doing a dlopen but just redirects you back to the main libc: if there somehow isn't one, which should be impossible but which you are attempting, I'd expect just a crash).

I think this whole thing needs discussion on libc-alpha and/or libc-help (really, I'd recommend libc-alpha, since this stuff is so difficult that it probably needs some change to glibc to make this obviously important use case work better without terrifying and mind-melting dangerous hacks).

sylware commented 2 years ago

I have to say I forgot about this TLS errno because I usually use direct linux syscalls, use no-errno basic glibc functions, or I explicitely ignore errno with glibc functions using it, namely my code does not have the location address in the TLS hardware segment (x86_64), because I don't parse the module ELF structures in order to get the errno TLS offset and ask the ABI function __tls_get_addr() for it. This is the underlaying ideas I was actually pushing forward.

nickalcock commented 2 years ago

That doesn't feel very safe to me. If the price of doing things this way is ignoring all errors from glibc... I'd say it's safer to do almost anything else, and this approach should simply never be used. (Among other things, this relies on the set of glibc functions that set errno never changing. Needless to say this is not a guarantee glibc provides!)

sylware commented 2 years ago

Finally, I did have a look: dlsym is TLS safe. So, in the end, no problem there.

The real problem is the static libstdc++ which does not libdl anything from the system and seems to link with some glibc internal symbols. The static libgcc (the one without exception handling) seems to be mostly safe as it is mostly a real leaf utility lib.

All distributed game binaries are to be pure and simple ELF64 binaries (with the least amount of relocation types), with only libdl as a from-the-system statically loaded lib (ELF DT_NEEDED), and with undefined symbols in the ELF dynsym only from those very distributed binaries.

The teaching is c++(gcc) is currently, and still(!), NOT binary-only distribution friendly.

clang c++ static libstdc++ and libgcc not broken like gcc ones?

(I have already an accute negative opinion of c++, now it is even worse).

nickalcock commented 2 years ago

Of course they are, because this is not brokenness: this is just libstdc++ using perfectly normal, public functions in glibc, some of which are implemented as macros that call differently-named functions inside glibc itself (usually to allow structure versioning, but not always). All these __-implementation-defined internal functions are of course ABI-stable, because if they weren't, almost no program would be able to benefit from glibc's ABI stability guarantees and run on a new system after building on an old one, so calling them is fine: what isn't fine is expecting to be able to build on a new system and run on an old one, which is simply something that has never been expected to work, not with glibc nor with any other shared library in the entire history of Unix. It just so happens that glibc 2.34 made that extra-obvious by bumping the ABI of a symbol used at startup.

LLVM's C++ standard library is going to be doing the same thing as libstdc++ is here because it too wants to call things like assert(), use errno, etc, all of which happen to be implemented using implementation-namespaced symbols. Using some things, like __cxa_atexit, is more or less required by the ABI: see e.g. llvm/libcxxabi/src/cxa_thread_atexit.cpp (heck, libcxxabi is rife with these sorts of things).

sylware commented 1 year ago

Yep, c++ is the problem again.

I guess a fork of the static libstdc++ from gcc and/or clang would be required to generate "clean" and pure elf64 binaries robust against symbol version abuse and more.

That said, the glibc devs have geniuses too: one of them did add a new version of libc_start_main in glibc 2.34, which is in the main libc shared object.

It "just" means that any game exe linked with 2.34 will refuse to load on any system with a set of glibc libs < 2.34...

I wanted to fire all c++ gcc devs, now I want to do the same with the glibc devs.

sylware commented 1 year ago

The unreal engine 5.2 is using clang/llvm, maybe their c++ runtime has the decency to go thru libdl? Not like gcc libstdc++?

nickalcock commented 1 year ago

That said, the glibc devs have geniuses too: one of them did add a new version of libc_start_main in glibc 2.34, which is in the main libc shared object.

Yes; it's annoying, but the alternative was leaving a known ROP gadget in every single executable (the ELF array constructor code, which literally traverses an array and calls every function in it: you don't really need to do anything special to make that thing a ROP gadget) -- and for non-PIE executables, at a known, constant address too. Making that go away seemed more important. It doesn't prove that glibc people are idiots or sadistic people who hate you, it's just an engineering tradeoff, and "attackers can run arbitrary code in every single binary" is quite bad and really does deserve a fix.

It's not like other things don't break symbol version in every glibc release. This one was just unusual because it's used by every binary (but not shared libraries). I mean this is why Valve uses containers for everything these days anyway, which should work around this without any problems at all.

(an aside: With most symbol version changes, the mythical never-written machinery that would let you build stuff for older glibcs on systems running newer ones (by forcing older symbol versions) might have saved us, but this fix proved the limitations of that approach too -- it changes the actual startup code linked into every binary, so just changing symbol versions would only make things crash differently, not fix anything.)

lostgoat commented 1 year ago

Closing as the relevant answer has already been provided by @smcv.

When targeting the steam runtime, the application's build system must use the corresponding runtime's SDK.