Closed wknapik closed 2 years ago
every instance is out of memory. swap increase or RAM increase should solve this.
In my experience swapping in CI is a bad idea. Kills performance. IMO we should add more RAM.
@wknapik https://downloadmoreram.com/
Let's download and add some more RAM.
OOM might not be the issue [1], but a bug in lld
or some other memory-related bug.
malloc_report
is trying to report something, but to see the data we should set environment variable MallocDebugReport=crash
before building and look at the additional output when a failure occurs.
also it might be good to dump system events log after the crash, maybe something interesting will be there.
malloc_report is trying to report something, but to see the data we should set environment variable MallocDebugReport=crash before building and look at the additional output when a failure occurs.
Adding that in https://github.com/brave/devops/pull/8228
also it might be good to dump system events log after the crash, maybe something interesting will be there.
Got a command for this?
Got a command for this?
dump syslog:
cat /var/log/system.log
dump last 10 minutes of Apple System Log:
log show --last 10m
@goodov I was hoping for a way to get the exact info we need
mac-mini1-ky:~ jenkins$ log show --last 10m|wc -l
53094
mac-mini1-ky:~ jenkins$ wc -l /var/log/system.log
76 /var/log/system.log
mac-mini1-ky:~ jenkins$
so something between the 70 and 50k lines the two commands produce ;]
We could pass some extra arguments to log show
and/or grep the output, but I don't know what patterns we're looking for...
Worst case we can just dump the 50k lines, but if you know of a way to trim this down a bit, that would be nice. For instance knowing which service would log the info would be enough to narrow it down to only logs from that service.
but I don't know what patterns we're looking for...
me neither :) if you can run the ASL dump right after the npm run build
failure, then we can trim ASL to 1 minute. should be good enough.
@goodov got a new hit - https://ci.brave.com/job/brave-browser-build-macos-arm64/1456/execution/node/103/log/#-33526383000f97e22-c981-4d1e-8d5b-6c0b7408d959. Here's the last 5min of logs dumped after the failure https://ci.brave.com/job/brave-browser-build-macos-arm64/1456/execution/node/110/log/?consoleFull.
after adding export MallocDebugReport=crash
we got this:
20:12:46 ld64.lld(2483,0x70000dc21000) malloc: *** error for object 0x6000c26fd110: pointer being freed was not allocated
20:12:46 ld64.lld(2483,0x70000dc21000) malloc: *** set a breakpoint in malloc_error_break to debug
20:12:46 PLEASE submit a bug report to https://crbug.com/ and run tools/clang/scripts/process_crashreports.py (only works inside Google) which will upload a report and include the crash backtrace.
20:12:46 Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
20:12:46 0 lld 0x000000010a6ffefb llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::printOptionValue(unsigned long, bool) const + 443419
20:12:46 1 lld 0x000000010a7002c4 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::printOptionValue(unsigned long, bool) const + 444388
20:12:46 2 libsystem_platform.dylib 0x00007ff806895e2d _sigtramp + 29
20:12:46 3 lld 0x000000010a6172f3 llvm::OuterAnalysisManagerProxy<llvm::AnalysisManager<llvm::Module>, llvm::Function>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) + 44099
20:12:46 4 libsystem_c.dylib 0x00007ff8067ccd10 abort + 123
20:12:46 5 libsystem_malloc.dylib 0x00007ff8066aa3e2 has_default_zone0 + 0
20:12:46 6 libsystem_malloc.dylib 0x00007ff8066ad5ed malloc_report + 151
20:12:46 7 lld 0x000000010b7b44de llvm::Expected<llvm::ArrayRef<llvm::object::XCOFFRelocation32>> llvm::object::XCOFFObjectFile::relocations<llvm::object::XCOFFSectionHeader32, llvm::object::XCOFFRelocation32>(llvm::object::XCOFFSectionHeader32 const&) const + 56862
it looks like a memory bug in lld
.
Looking into possible crashdumps now and further discussion with Chromium. tools/clang/crashreports
doesn't have anything to upload.
Another hit (build log, syslog).
I see lld being mentioned
20:26:06 2022-09-05 11:25:41.892499-0700 0x3b2c59 Default 0x0 0 0 kernel: lld[38257] Corpse allowed 1 of 5
20:26:06 2022-09-05 11:25:41.893206-0700 0x3b2c36 Default 0x0 0 0 kernel: Failed to send exception EXC_CORPSE_NOTIFY. error code: 5 for pid 38257
20:26:06 2022-09-05 11:25:42.206520-0700 0x68 Default 0x0 0 0 kernel: Corpse released, count at 0
got this during one of the reproduce runs, but it’s unclear if this is the initial culprit or an aftermath of a crash. this is a fatal error that aborts the link process, but it didn't appear in the output before.
LLVM ERROR: Failed to rename temporary file thinlto-cache/Thin-09cb92.tmp.o to thinlto-cache/llvmcache-33702F878EB43F8E74A4416DA83A72AE95819344: No such file or directory
We enabled developer mode on mac-imac3-ky via
sudo DevToolsSecurity -enable
sudo dseditgroup -o edit -a jenkins -t user _developer
This should be reverted after debugging is done.
mac-imac3-ky
.lld
crash is pretty rare (around 1 in ~100 links) and only happens in lld
Release/RelWithDebInfo
configurations.lld
with ASan hides the crash, i.e. no sanitizer errors and no crash. Rebuilding lld
in Debug configuration also hides the crash. lld
with UBSan revealed some unaligned memory access errors because of reinterpret_casts, but fixing them hasn't solve the issue.lld
with TSan didn't show any issues.Properly symbolized crashes look like this:
ld64.lld(58975,0x70000f179000) malloc: *** error for object 0x6002a0484000: pointer being freed was not allocated
ld64.lld(58975,0x70000f179000) malloc: *** set a breakpoint in malloc_error_break to debug
0. Running pass 'Function Pass Manager' on module 'Users/jenkins/jenkins/workspace/goodov-linking-errors-nightly/src/out/Release/obj/third_party/vulkan-deps/spirv-tools/src/libspvtools_opt.aamd_ext_to_khr.o6886248'.
1. Running pass 'Debug Variable Analysis' on function '@_ZN8spvtools5utils14AppendToVectorERKNSt2Cr12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEPNS1_6vectorIjNS5_IjEEEE'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0 lld 0x000000010d11fcee llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 46
1 lld 0x000000010d11ee45 llvm::sys::RunSignalHandlers() + 85
2 lld 0x000000010d120340 SignalHandler(int) + 272
3 libsystem_platform.dylib 0x00007ff809a3ae2d _sigtramp + 29
4 libsystem_platform.dylib 0x00000000000000e0 _sigtramp + 18446603370419213008
5 libsystem_c.dylib 0x00007ff809971d10 abort + 123
6 libsystem_malloc.dylib 0x00007ff80984f3e2 has_default_zone0 + 0
7 libsystem_malloc.dylib 0x00007ff8098525ed malloc_report + 151
8 lld 0x000000010e0aafe8 llvm::LexicalScopes::~LexicalScopes() + 344
9 lld 0x000000010e5b3183 (anonymous namespace)::LDVImpl::runOnMachineFunction(llvm::MachineFunction&, bool) + 15091
10 lld 0x000000010e67205d llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 349
11 lld 0x000000010f97be9f llvm::FPPassManager::runOnFunction(llvm::Function&) + 1055
12 lld 0x000000010f982571 llvm::FPPassManager::runOnModule(llvm::Module&) + 65
0. Running pass 'Function Pass Manager' on module 'Users/jenkins/jenkins/workspace/goodov-linking-errors-nightly/src/out/Release/obj/third_party/vulkan-deps/spirv-tools/src/libspvtools_opt.ainstrument_pass.o242484496'.
1. Running pass 'Debug Variable Analysis' on function '@_ZN8spvtools3opt14InstrumentPass14GetVec4FloatIdEv'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0 lld 0x000000011025316b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 43
1 lld 0x0000000110253534 SignalHandler(int) + 244
2 libsystem_platform.dylib 0x00007ff80d96be2d _sigtramp + 29
3 libsystem_malloc.dylib 0x00007ff80d7732a9 nanov2_size + 20
4 libsystem_c.dylib 0x00007ff80d8a2d10 abort + 123
5 libsystem_malloc.dylib 0x00007ff80d7803e2 has_default_zone0 + 0
6 libsystem_malloc.dylib 0x00007ff80d7835ed malloc_report + 151
7 lld 0x000000010f692a95 llvm::LexicalScopes::~LexicalScopes() + 133
8 lld 0x000000010f8a44eb (anonymous namespace)::LDVImpl::runOnMachineFunction(llvm::MachineFunction&, bool) + 2555
9 lld 0x000000010f9255e4 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 244
The crash happens on exit from LDVImpl::computeIntervals()
in LexicalScopes::~LexicalScopes
destructor: some unordered_map
items fail to deallocate.
Chromium has hit a similar bug recently on Windows, but it doesn't look like our case - they've got a stable reproducible assert()
.
I did debug the failed lld instance in lldb
, but at the point of failure it's absolutely unclear what exactly and at what point has failed. The code needs to be profiled carefully to pinpoint the exact place when things go downhill, it requires a lot of time.
linker_driver.py
once in a case of failure when lto-enabled link is detected.Change src/tools/clang/scripts/build.py
to enable sanitizer:
@@ -852,7 +854,7 @@ def main():
# PGO needs libclang_rt.profile but none of the other compiler-rt stuff.
bootstrap_args.extend([
'-D' + f
- for f in compiler_rt_cmake_flags(sanitizers=False, profile=args.pgo)
+ for f in compiler_rt_cmake_flags(sanitizers=True, profile=args.pgo)
])
if sys.platform == 'darwin':
bootstrap_args.extend([
@@ -1197,6 +1205,7 @@ def main():
cmake_args.append('-DRUNTIMES_' + triple + '_' + arg)
cmake_args.append('-DLLVM_BUILTIN_TARGETS=' + all_triples)
cmake_args.append('-DLLVM_RUNTIME_TARGETS=' + all_triples)
+ cmake_args.append('-DLLVM_USE_SANITIZER=Address')
if os.path.exists(LLVM_BUILD_DIR):
RmTree(LLVM_BUILD_DIR)
./build.py --bootstrap --without-android --without-fuchsia
If no one objects, I will add a temporary fix to restart the linker inside linker_driver.py once in a case of failure when lto-enabled link is detected.
sgtm!
Thanks for taking the time to debug this @goodov! Even though the failures may be rare, we build frequently enough that it translates into a lot of failed builds requiring manual reruns. Given our limited macOS resources, it's all the more important to fix, or at least work around this.
Please lmk when we can go ahead with the wrap-up steps from the first post in this issue:
- [ ] Disable developer mode on mac-imac3-ky
- [ ] Revoke dev access to mac-imac3-ky
- [ ] Possibly lower timeouts in macos pipelines
The PR with a workaround is ready to be merged, but I'd like to see another failure on nightly
before merging in case the issue has been fixed in upstream llvm.
@goodov I don't think I've seen any more linking errors recently (which is great!). Should we proceed with the wrap up steps for this issue?
yes, let's wrap up. If it will reappear, we'll return to the workaround.
@goodov do we know how this got fixed?
@goodov do we know how this got fixed?
no. I tried to look into the build history to figure out when the crashes stopped to maybe map to a Chromium update, but the available history is pretty short.
the history I'm talking about is here: https://ci.brave.com/job/brave-browser-build-macos-x64/ and here: https://ci.brave.com/job/brave-browser-build-macos-arm64/
Alright, thanks again for looking into it. Will proceed with the wrap up steps and close the issue.
Before closing this issue
We keep seeing these errors in public builds on macOS arm64 and x64 for a while. Rebuilds succeed.