cisco / ChezScheme

Chez Scheme
Apache License 2.0
6.99k stars 987 forks source link

workaround Clang v15 AArch64 miscompile that affects parallel collection #879

Closed mflatt closed 1 month ago

mflatt commented 1 month ago

This patch avoids a miscompile using Clang v15 on macOS. The default compiler on macOS was recently upgraded to Clang v16, which appears to fix the problem, and I have not been able to replicate the problem with Clang v15 variants that are available in Linux distributions. So, it might be ok to just ignore the problem. But since v15 installations are likely to hang around for a while in other macOS installations, since the workaround is simple, since Racket users who build themselves are affected, and since I spent a lot of time tracking down the problem, I'm inclined to include a workaround.

For details on the miscompile at it affects Chez Scheme, see clang15-miscompile.zip.

mflatt commented 1 month ago

I spent so long tracking this down that I'd like to tell you the long story, even though it doesn't really matter. The miscompile seems like a run-of-the-mill compiler error, but the way it affected Chez Scheme and Racket made it especially difficult to find.

During 2022-2024, I've tried off and on to track down an occasional failure in Racket builds on my macOS M1/M2 laptops. Memory would get mangled late in the build — specifically during documentation rendering for he "math" library, which uses libgmp and libmpfr in multi-threaded mode. Since the problem never happened on x86_64, and since it only happened during parallel documentation rendering, I was pretty sure that I was looking for some sort of race condition exposed by AArch64's weak memory coherence.

Although I discovered that I could provoke a crash by just rebuilding documentation, even that step takes 10 minutes, and the crash would only happen rarely, so getting a crash would take hours. Any little change I made to try to gather information would make the crash go away or become much more difficult to provoke, so hours turned to days.

Meanwhile, users of the Racket main distribution were not running into problems, which I chalked up to the fact that documentation is pre-rendered. Also, maybe more generally libgmp or libmpfr needed to be involved, so maybe it wasn't my problem. In any case, the lack of reports made the problem feel less of an emergency than I would normally consider crashing bugs, especially since I had so much trouble replicating the crash or pinpointing an issue. So, I'd burn a day or three on the issue every few months.

In September 2024, I finally gathered evidence to suspect that the problem was in the GC's parallel mode. And with that suspicion, I was finally able to make a small Chez Scheme program with the right ingredients to crash, showing that the problem was independent of Racket and math libraries. The big difference was being able to provoke a crash within seconds instead of hours, and I found the problem over the next day.

In retrospect, it's clear why the problem was so difficult to find. I was pretty sure I was looking for a memory race, but that turned out to be because only multi-threaded programs could reach the miscompiled code. And only during parallel collections. And only when the collector is looking at specific words within a thread representing virtual registers, which are not something that programs normally use directly. The effect of the miscompile was that a "does this object belong to me?" check would succeed when it shouldn't. That matters only when a thread has an object in its virtual register that was allocated by a different thread, which is an even more rare use of a virtual register. And even when it goes wrong, there's only a small chance that different collector threads will end up looking at the same object at the same time, and even concurrent traversal of the same object will turn out ok a lot of the time! Finally, and most perniciously, the miscompile creates a race that isn't in the source code, and in a code template that is put in place by a macro that is used dozens of times in the output (and compiled ok in all other other instances).

Meanwhile, Racket distributions are compiled with Clang v12, which is why it hasn't been a problem for Racket users, even when they run programs with parallelism.

maoif commented 1 month ago

Thanks for fix and sharing your experience of tracking down this tricky bug.

ufo5260987423 commented 1 month ago

You are the hero!

glandium commented 1 month ago
#if defined(__arm64__) && defined(__clang__) && (__clang_major__ == 15)

FYI, __clang_major__ from the clang provided by Apple on macos/Xcode does not match the upstream clang version's __clang_major__. For some reason, Apple decided clang's version was Xcode's. But it doesn't match the LLVM version it's derived from. Xcode 15's clang is based on LLVM 16. So, __clang_major__ == 15 matches entirely different versions of the compiler on clang versions that don't come from Xcode/Apple. You may want to add defined(__apple_build_version__)

mflatt commented 3 weeks ago

Update and correction: The problem appears to be a linker bug, not a compiler bug.

I tried building different versions of Clang from https://github.com/swiftlang/llvm-project, and no version that I tried produced a crashing program on my machine with Apple's v16 tools. All versions that I tried produced a crash on my machine with v15 tools. With that hint, I found that copying object files between the machines also leads to a crash when they're linked on the machine with v15 tools, independent of the machine/compiler used to generate the object files.

A linker problem makes sense; it just didn't occur to me before. It's a more clear explanation of why the problem is macOS-specific. It also means that the workaround is indirect — using the compiler version as a proxy for the linker that will be used — but still seems good enough as a workaround for older tools.

Maybe a more specific explanation could be pinned down by building different linker versions from the sources at https://github.com/apple-oss-distributions, but I don't have or know the right setup for that.

jryans commented 3 weeks ago

Xcode 15 was the first version that included a new linker, perhaps these issues were related to bugs in that new codebase... A few linking issues were fixed in 15.1. There may also be bug fixes that didn't make it into the release notes.