TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.13k stars 102 forks source link

Apparently creating too many threads for tracing-subscriber #1288

Open cormacrelf opened 2 weeks ago

cormacrelf commented 2 weeks ago

Getting a lot of these panics early after startup & running a clean build, which seems to create failures all over the place.

thread 'tokio-runtime-worker' panicked at /nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-vendor-cargo-deps/c19b7c6f923b580ac259164a89f2577984ad5ab09ee9d583b888f934adbbe8d0/sharded-slab-0.1.7/src/tid.rs:163:21:
creating a new thread ID (8374) would exceed the maximum number of thread ID bits specified in sharded_slab::cfg::DefaultConfig (8191)
  2024-08-28T08:11:17.344013Z ERROR nativelink_store::filesystem_store: Failed to delete file, file_path: "/nativelink/data/tmp_path-worker_cas/...", err: Error { code: Internal, ... it panicked, basically }
    at nativelink-store/src/filesystem_store.rs:124
    in nativelink_store::filesystem_store::filesystem_delete_file
    in nativelink_store::filesystem_store::filesystem_store_emplace_file
    in nativelink_worker::local_worker::worker_start_action
    in nativelink::worker with name: "worker_0"

The implication here is nativelink is creating 8000+ threads. It can apparently recover if you restart the build, which is nice. The only crate in the graph that depends on sharded-slab is tracing-subscriber, and I assume that's the code that is using the default limits. I think it's weird that nativelink would be creating 8000+ threads. 8000 sounds like a perfectly sane limit.

Nativelink version 0.5.1 from GitHub running in docker.

allada commented 2 weeks ago

Hmmm, is this specific to tracing-subscriber? From what I see in nativelink we limit blocking threads to ~5k by default (math is 10 * config.global_cfg.max_open_files).

We do 10x because of an edge case that can happen when limiting max open files, in some cases open file limit can need more than 1 descriptor.

cormacrelf commented 2 weeks ago

Hm, maybe the max open files is set too high in my config. What happens if you try to schedule more work than fits in the max open files limit? Does it fail in a similar way to hitting a ulimit? Or does the scheduler avoid it / things just queue up? If it's the latter I can fix this by dropping max open files back to a reasonable number.

cormacrelf commented 2 weeks ago

Yeah, that fixed it. I think this means max_open_files should absolutely never be more than 800 with the current 10x-ing thread limit behaviour. Not that I actually read the docs when I set it way too high, but may be worth adding to them.

aaronmondal commented 2 weeks ago

Sounds like adding this info to the docs is a good first issue :relaxed:

cormacrelf commented 2 weeks ago

Hmmm... it did fix this problem, but now nativelink seems to be running 1-2 actions at a time. This started happening pretty late in a big build, initially it was fine, but it seems to have run out of open files. Set max_open_files to 512, and it's got about 600 threads running (not 5000). Sounds to me like something is is keeping files open or failing to decrement the open files limit or something. Build graph was about 2000 nodes.

; ls /proc/$(pgrep nativelink)/fdinfo/ | wc -l
543

Edit: basically this is filesystem_cas.json with a local worker also defined. You guys probably don't test this configuration that often. Maybe need to split up the filesystem & worker into two separate nativelink processes (i.e. containers), one for the worker, so the filesystem code does not eat up the open file limits that the worker needs.

cormacrelf commented 2 weeks ago

Actually, late in the build graph you have things with many dependencies, and those dep lists are just long lists of object files. Especially actions that are linking binaries and pull in the full link graph. So in this weird way, it may make sense we would hit open file limits more with more dependencies. Still would be a bit weird to consume an open file handle for that while the action is executed.