brave / brave-browser

Brave browser for Android, iOS, Linux, macOS, Windows.
https://brave.com
Mozilla Public License 2.0
17.79k stars 2.32k forks source link

Repeating macOS linking errors #23734

Closed wknapik closed 2 years ago

wknapik commented 2 years ago

Before closing this issue


We keep seeing these errors in public builds on macOS arm64 and x64 for a while. Rebuilds succeed.

FAILED: libVkLayer_khronos_validation.dylib libVkLayer_khronos_validation.dylib.TOC libVkLayer_khronos_validation.dylib.dSYM/Contents/Info.plist libVkLayer_khronos_validation.dylib.dSYM/Contents/Resources/DWARF/libVkLayer_khronos_validation.dylib 
FAILED: libGLESv2.dylib libGLESv2.dylib.TOC libGLESv2.dylib.dSYM/Contents/Info.plist libGLESv2.dylib.dSYM/Contents/Resources/DWARF/libGLESv2.dylib 
FAILED: character_data_generator character_data_generator.dSYM/Contents/Info.plist character_data_generator.dSYM/Contents/Resources/DWARF/character_data_generator 
FAILED: v8_context_snapshot_generator v8_context_snapshot_generator.dSYM/Contents/Info.plist v8_context_snapshot_generator.dSYM/Contents/Resources/DWARF/v8_context_snapshot_generator 
FAILED: chrome_crashpad_handler chrome_crashpad_handler.dSYM/Contents/Info.plist chrome_crashpad_handler.dSYM/Contents/Resources/DWARF/chrome_crashpad_handler 
FAILED: torque torque.dSYM/Contents/Info.plist torque.dSYM/Contents/Resources/DWARF/torque 
FAILED: obj/chrome/chrome_framework_shared_library/Brave Browser Nightly Framework obj/chrome/chrome_framework_shared_library/Brave Browser Nightly Framework.TOC Brave Browser Nightly Framework.dSYM/Contents/Info.plist Brave Browser Nightly Framework.dSYM/Contents/Resources/DWARF/Brave Browser Nightly Framework
FAILED: mksnapshot mksnapshot.dSYM/Contents/Info.plist mksnapshot.dSYM/Contents/Resources/DWARF/mksnapshot 
or
FAILED: clang_x64_v8_arm64/mksnapshot clang_x64_v8_arm64/mksnapshot.dSYM/Contents/Info.plist clang_x64_v8_arm64/mksnapshot.dSYM/Contents/Resources/DWARF/mksnapshot
FAILED: protoc protoc.dSYM/Contents/Info.plist protoc.dSYM/Contents/Resources/DWARF/protoc 
FAILED: generate_colors_info generate_colors_info.dSYM/Contents/Info.plist generate_colors_info.dSYM/Contents/Resources/DWARF/generate_colors_info 
FAILED: libvk_swiftshader.dylib libvk_swiftshader.dylib.TOC libvk_swiftshader.dylib.dSYM/Contents/Info.plist libvk_swiftshader.dylib.dSYM/Contents/Resources/DWARF/libvk_swiftshader.dylib 
FAILED: app_mode_loader app_mode_loader.dSYM/Contents/Info.plist app_mode_loader.dSYM/Contents/Resources/DWARF/app_mode_loader 
FAILED: ipc_plugin ipc_plugin.dSYM/Contents/Info.plist ipc_plugin.dSYM/Contents/Resources/DWARF/ipc_plugin
goodov commented 2 years ago

every instance is out of memory. swap increase or RAM increase should solve this.

wknapik commented 2 years ago

In my experience swapping in CI is a bad idea. Kills performance. IMO we should add more RAM.

rebron commented 2 years ago

@wknapik https://downloadmoreram.com/ Let's download and add some more RAM.

goodov commented 2 years ago

OOM might not be the issue [1], but a bug in lld or some other memory-related bug.

malloc_report is trying to report something, but to see the data we should set environment variable MallocDebugReport=crash before building and look at the additional output when a failure occurs.

also it might be good to dump system events log after the crash, maybe something interesting will be there.

  1. https://github.com/brave/devops/issues/8156#issuecomment-1230784189
wknapik commented 2 years ago

malloc_report is trying to report something, but to see the data we should set environment variable MallocDebugReport=crash before building and look at the additional output when a failure occurs.

Adding that in https://github.com/brave/devops/pull/8228

also it might be good to dump system events log after the crash, maybe something interesting will be there.

Got a command for this?

goodov commented 2 years ago

Got a command for this?

dump syslog: cat /var/log/system.log dump last 10 minutes of Apple System Log: log show --last 10m

wknapik commented 2 years ago

@goodov I was hoping for a way to get the exact info we need

mac-mini1-ky:~ jenkins$ log show --last 10m|wc -l
   53094
mac-mini1-ky:~ jenkins$ wc -l /var/log/system.log
      76 /var/log/system.log
mac-mini1-ky:~ jenkins$ 

so something between the 70 and 50k lines the two commands produce ;] We could pass some extra arguments to log show and/or grep the output, but I don't know what patterns we're looking for...

Worst case we can just dump the 50k lines, but if you know of a way to trim this down a bit, that would be nice. For instance knowing which service would log the info would be enough to narrow it down to only logs from that service.

goodov commented 2 years ago

but I don't know what patterns we're looking for...

me neither :) if you can run the ASL dump right after the npm run build failure, then we can trim ASL to 1 minute. should be good enough.

wknapik commented 2 years ago

@goodov got a new hit - https://ci.brave.com/job/brave-browser-build-macos-arm64/1456/execution/node/103/log/#-33526383000f97e22-c981-4d1e-8d5b-6c0b7408d959. Here's the last 5min of logs dumped after the failure https://ci.brave.com/job/brave-browser-build-macos-arm64/1456/execution/node/110/log/?consoleFull.

goodov commented 2 years ago

after adding export MallocDebugReport=crash we got this:

20:12:46  ld64.lld(2483,0x70000dc21000) malloc: *** error for object 0x6000c26fd110: pointer being freed was not allocated
20:12:46  ld64.lld(2483,0x70000dc21000) malloc: *** set a breakpoint in malloc_error_break to debug
20:12:46  PLEASE submit a bug report to https://crbug.com/ and run tools/clang/scripts/process_crashreports.py (only works inside Google) which will upload a report and include the crash backtrace.
20:12:46  Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
20:12:46  0  lld                      0x000000010a6ffefb llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::printOptionValue(unsigned long, bool) const + 443419
20:12:46  1  lld                      0x000000010a7002c4 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::printOptionValue(unsigned long, bool) const + 444388
20:12:46  2  libsystem_platform.dylib 0x00007ff806895e2d _sigtramp + 29
20:12:46  3  lld                      0x000000010a6172f3 llvm::OuterAnalysisManagerProxy<llvm::AnalysisManager<llvm::Module>, llvm::Function>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) + 44099
20:12:46  4  libsystem_c.dylib        0x00007ff8067ccd10 abort + 123
20:12:46  5  libsystem_malloc.dylib   0x00007ff8066aa3e2 has_default_zone0 + 0
20:12:46  6  libsystem_malloc.dylib   0x00007ff8066ad5ed malloc_report + 151
20:12:46  7  lld                      0x000000010b7b44de llvm::Expected<llvm::ArrayRef<llvm::object::XCOFFRelocation32>> llvm::object::XCOFFObjectFile::relocations<llvm::object::XCOFFSectionHeader32, llvm::object::XCOFFRelocation32>(llvm::object::XCOFFSectionHeader32 const&) const + 56862

it looks like a memory bug in lld.

Looking into possible crashdumps now and further discussion with Chromium. tools/clang/crashreports doesn't have anything to upload.

wknapik commented 2 years ago

Another hit (build log, syslog).

I see lld being mentioned

20:26:06  2022-09-05 11:25:41.892499-0700 0x3b2c59   Default     0x0                  0      0    kernel: lld[38257] Corpse allowed 1 of 5
20:26:06  2022-09-05 11:25:41.893206-0700 0x3b2c36   Default     0x0                  0      0    kernel: Failed to send exception EXC_CORPSE_NOTIFY. error code: 5 for pid 38257
20:26:06  2022-09-05 11:25:42.206520-0700 0x68       Default     0x0                  0      0    kernel: Corpse released, count at 0
goodov commented 2 years ago

got this during one of the reproduce runs, but it’s unclear if this is the initial culprit or an aftermath of a crash. this is a fatal error that aborts the link process, but it didn't appear in the output before.

LLVM ERROR: Failed to rename temporary file thinlto-cache/Thin-09cb92.tmp.o to thinlto-cache/llvmcache-33702F878EB43F8E74A4416DA83A72AE95819344: No such file or directory
wknapik commented 2 years ago

We enabled developer mode on mac-imac3-ky via

sudo DevToolsSecurity -enable
sudo dseditgroup -o edit -a jenkins -t user _developer

This should be reverted after debugging is done.

goodov commented 2 years ago

Some info after experiments on mac-imac3-ky.

  1. The lld crash is pretty rare (around 1 in ~100 links) and only happens in lld Release/RelWithDebInfo configurations.
  2. Rebuilding lld with ASan hides the crash, i.e. no sanitizer errors and no crash. Rebuilding lld in Debug configuration also hides the crash.
  3. Rebuilding lld with UBSan revealed some unaligned memory access errors because of reinterpret_casts, but fixing them hasn't solve the issue.
  4. Rebuilding lld with TSan didn't show any issues.

Properly symbolized crashes look like this:

ld64.lld(58975,0x70000f179000) malloc: *** error for object 0x6002a0484000: pointer being freed was not allocated
ld64.lld(58975,0x70000f179000) malloc: *** set a breakpoint in malloc_error_break to debug
0.  Running pass 'Function Pass Manager' on module 'Users/jenkins/jenkins/workspace/goodov-linking-errors-nightly/src/out/Release/obj/third_party/vulkan-deps/spirv-tools/src/libspvtools_opt.aamd_ext_to_khr.o6886248'.
1.  Running pass 'Debug Variable Analysis' on function '@_ZN8spvtools5utils14AppendToVectorERKNSt2Cr12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEPNS1_6vectorIjNS5_IjEEEE'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  lld                      0x000000010d11fcee llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 46
1  lld                      0x000000010d11ee45 llvm::sys::RunSignalHandlers() + 85
2  lld                      0x000000010d120340 SignalHandler(int) + 272
3  libsystem_platform.dylib 0x00007ff809a3ae2d _sigtramp + 29
4  libsystem_platform.dylib 0x00000000000000e0 _sigtramp + 18446603370419213008
5  libsystem_c.dylib        0x00007ff809971d10 abort + 123
6  libsystem_malloc.dylib   0x00007ff80984f3e2 has_default_zone0 + 0
7  libsystem_malloc.dylib   0x00007ff8098525ed malloc_report + 151
8  lld                      0x000000010e0aafe8 llvm::LexicalScopes::~LexicalScopes() + 344
9  lld                      0x000000010e5b3183 (anonymous namespace)::LDVImpl::runOnMachineFunction(llvm::MachineFunction&, bool) + 15091
10 lld                      0x000000010e67205d llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 349
11 lld                      0x000000010f97be9f llvm::FPPassManager::runOnFunction(llvm::Function&) + 1055
12 lld                      0x000000010f982571 llvm::FPPassManager::runOnModule(llvm::Module&) + 65
0.  Running pass 'Function Pass Manager' on module 'Users/jenkins/jenkins/workspace/goodov-linking-errors-nightly/src/out/Release/obj/third_party/vulkan-deps/spirv-tools/src/libspvtools_opt.ainstrument_pass.o242484496'.
1.  Running pass 'Debug Variable Analysis' on function '@_ZN8spvtools3opt14InstrumentPass14GetVec4FloatIdEv'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  lld                      0x000000011025316b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 43
1  lld                      0x0000000110253534 SignalHandler(int) + 244
2  libsystem_platform.dylib 0x00007ff80d96be2d _sigtramp + 29
3  libsystem_malloc.dylib   0x00007ff80d7732a9 nanov2_size + 20
4  libsystem_c.dylib        0x00007ff80d8a2d10 abort + 123
5  libsystem_malloc.dylib   0x00007ff80d7803e2 has_default_zone0 + 0
6  libsystem_malloc.dylib   0x00007ff80d7835ed malloc_report + 151
7  lld                      0x000000010f692a95 llvm::LexicalScopes::~LexicalScopes() + 133
8  lld                      0x000000010f8a44eb (anonymous namespace)::LDVImpl::runOnMachineFunction(llvm::MachineFunction&, bool) + 2555
9  lld                      0x000000010f9255e4 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 244

The crash happens on exit from LDVImpl::computeIntervals() in LexicalScopes::~LexicalScopes destructor: some unordered_map items fail to deallocate.

Chromium has hit a similar bug recently on Windows, but it doesn't look like our case - they've got a stable reproducible assert().

I did debug the failed lld instance in lldb, but at the point of failure it's absolutely unclear what exactly and at what point has failed. The code needs to be profiled carefully to pinpoint the exact place when things go downhill, it requires a lot of time.

Next steps

  1. Report this issue to llvm, but I don't expect a quick or easy fix for that. https://github.com/llvm/llvm-project/issues/58056
  2. If no one objects, I will add a temporary fix to restart the linker inside linker_driver.py once in a case of failure when lto-enabled link is detected.

The process of building llvm with sanitizers

  1. Change src/tools/clang/scripts/build.py to enable sanitizer:

    @@ -852,7 +854,7 @@ def main():
     # PGO needs libclang_rt.profile but none of the other compiler-rt stuff.
     bootstrap_args.extend([
         '-D' + f
    -        for f in compiler_rt_cmake_flags(sanitizers=False, profile=args.pgo)
    +        for f in compiler_rt_cmake_flags(sanitizers=True, profile=args.pgo)
     ])
     if sys.platform == 'darwin':
       bootstrap_args.extend([
    @@ -1197,6 +1205,7 @@ def main():
         cmake_args.append('-DRUNTIMES_' + triple + '_' + arg)
    cmake_args.append('-DLLVM_BUILTIN_TARGETS=' + all_triples)
    cmake_args.append('-DLLVM_RUNTIME_TARGETS=' + all_triples)
    +  cmake_args.append('-DLLVM_USE_SANITIZER=Address')
    
    if os.path.exists(LLVM_BUILD_DIR):
     RmTree(LLVM_BUILD_DIR)
  2. Run ./build.py --bootstrap --without-android --without-fuchsia
wknapik commented 2 years ago

If no one objects, I will add a temporary fix to restart the linker inside linker_driver.py once in a case of failure when lto-enabled link is detected.

sgtm!

wknapik commented 2 years ago

Thanks for taking the time to debug this @goodov! Even though the failures may be rare, we build frequently enough that it translates into a lot of failed builds requiring manual reruns. Given our limited macOS resources, it's all the more important to fix, or at least work around this.

Please lmk when we can go ahead with the wrap-up steps from the first post in this issue:

goodov commented 2 years ago

The PR with a workaround is ready to be merged, but I'd like to see another failure on nightly before merging in case the issue has been fixed in upstream llvm.

wknapik commented 2 years ago

@goodov I don't think I've seen any more linking errors recently (which is great!). Should we proceed with the wrap up steps for this issue?

goodov commented 2 years ago

yes, let's wrap up. If it will reappear, we'll return to the workaround.

wknapik commented 2 years ago

@goodov do we know how this got fixed?

goodov commented 2 years ago

@goodov do we know how this got fixed?

no. I tried to look into the build history to figure out when the crashes stopped to maybe map to a Chromium update, but the available history is pretty short.

the history I'm talking about is here: https://ci.brave.com/job/brave-browser-build-macos-x64/ and here: https://ci.brave.com/job/brave-browser-build-macos-arm64/

wknapik commented 2 years ago

Alright, thanks again for looking into it. Will proceed with the wrap up steps and close the issue.