nico commented 2 years ago

Command::spawn() in the rust stdlib unconditionally calls anon_pipe here: https://github.com/rust-lang/rust/blob/master/library/std/src/sys/unix/process/process_unix.rs#L62

anon_pipe on Linux calls pipe2 to set CLOEXEC on the pipe atomically: https://github.com/rust-lang/rust/blob/521734787ecf80ff12df7ca5998f7ec0b3b7b2c9/library/std/src/sys/unix/pipe.rs#L18

But macOS has no pipe2, so here the stdlib instead calls pipe() followed by set_cloexec: https://github.com/rust-lang/rust/blob/521734787ecf80ff12df7ca5998f7ec0b3b7b2c9/library/std/src/sys/unix/pipe.rs#L35

This means there's a window where the pipe is created but cloexec isn't set on the pipe's FDs yet. If a different thread forks in that window, the pipe's fds get leaked.

(Not that it matters, but set_cloexec is here https://github.com/rust-lang/rust/blob/master/library/std/src/sys/unix/fd.rs#L204 . On macOS and a few other OSs, it calls ioctl with FIOCLEX, which is just 1 syscall. On Linux and a few others, it calls fcntl with F_GETFD/F_SETFD which is 2 syscalls, but we don't need to call set_cloexec at all in this path on Linux.)

This is likely why n2 -j250 runs out of FDs on macOS, while ninja -j250 works fine.

Possible fixes:

Don't fork on background threads. Create subprocess on main thread, and then hand the created subprocess to background thread for waiting on it and to do deps processing
Don't use threads per subprocesses at all, use a select loop model instead (but still use threads for deps processing)
Don't use the Command object on non-Windows either (we already don't on Windows), and add mutexes around fork and pipe/set_cloexec

Spawning subprocesses takes some time (~~we made it async in chromium at some point for that reason~~ we looked at making it async in ninja for that reason), so moving it to the main thread would add some amount of work on to the critical path of n2.

evmar commented 2 years ago

Having thought about this a bit, I am curious about the "spawning subprocesses takes some time" part. If it's a skip-a-frame kind of latency, like up to 100ms occasionally, then it's probably fine to serialize it in the main process, though I guess that would interact poorly with -j1000. If it's a multi second kind of thing then definitely it needs a different approach.

I'd probably just go with a mutex around calling Command::spawn (instead of ::output) for now, with an eye towards rewriting to ppoll eventually (or tokio maybe, but I'm kinda skeptical of async Rust).

piscisaureus commented 2 years ago

Command::spawn() in the rust stdlib unconditionally calls anon_pipe here

I think it is unlikely that FDs from this pipe are leaked.

Both ends are short lived and explicitly closed: the write end is closed immediately after fork() returns, the read end is closed one as soon as the child has called exec().

So the only situation it could cause a problem is when the process literally has 100s of forked-but-not-yet-exec'ed children. Theoretically possible, but it sounds like a stretch to me.

bnoordhuis commented 2 years ago

Just throwing it out there: libstd uses posix_spawn() under the right conditions (instead of fork + execve) and macOS has a POSIX_SPAWN_CLOEXEC_DEFAULT attribute that does what its name suggests. Teaching libstd about it probably isn't too hard.

we made it async in chromium at some point for that reason

@nico I'm interested, can you say more? My understanding is that as long as you're using fork, there's no getting away from paying the cost of copying the working set.

nico commented 2 years ago

I think it is unlikely that FDs from this pipe are leaked.

It's possible this theory is wrong. Background is that n2 needs way more FDs than ninja (ninja -j250 works fine on my mac, n2 -j250 dies with "Too many open files (os error 24"), and I compared fs_usage logs.

Python script to generate build.ninja file that repros the problem

``` % cat fdtest.py import textwrap def write(f): f.write(textwrap.dedent('''\ rule b command = sleep 300; touch $out ''')) for i in range(1000): f.write(f'build foo{i}: b\n') # n2 needs an explicit default target: f.write('default') for i in range(1000): f.write(f' foo{i}') f.write('\n') with open('build.ninja', 'w', encoding='utf-8') as f: write(f) % python3 fdtest.py ```

I ran something like sudo fs_usage -w -f filesys| grep n2 > fs-n2.txt and then ran n2 -j50 for the build file generated by that python script. Then I did the same with ninja, and compared the two txt files.

This was the main difference that stood out, and there is a race there. So it seems plausible to me, but it's well possible this is not the root cause. (It does fit with things seemingly working fine on Linux -- but Linux also has a higher FD limit.)

@nico I'm interested, can you say more?

Turns out I was confusing things! We didn't do this in chromium, we were talking about doing this in ninja, here: https://bugs.chromium.org/p/chromium/issues/detail?id=829710#c3 (but as far as I know didn't end up merging it)

IIRC the background there was that someone was running ninja on a Windows system with some thing installed that made process creation very slow. But if you're in that situation, your build is going to be slow anyways and it's imho not necessarily something tools should go out of their way to mitigate.

So it's probably fine to just start processes on the main thread.

nico commented 2 years ago

25 fixes this locally for me, so it does look like that race creation race is indeed the cause of the fd leak.

WIth that patch applied, and the build.ninja file generated by the script in https://github.com/evmar/n2/issues/14#issuecomment-1084646807, I can run n2 -j250 and it works just as well as ninja.

I can also build llvm with n2 -j250 using goma, just like with ninja.

nico commented 2 years ago

(I also filed an upstream issue.)

nico commented 1 year ago

I think 9c2104a3ca3b6225b126ac193213c2642e1a7147 regressed this. The test in https://github.com/evmar/n2/issues/14#issuecomment-1084646807 repros this again.

(I don't have permissions to reopen this issue here.)

evmar / n2

n2 leaks fds on macOS #14

25 fixes this locally for me, so it does look like that race creation race is indeed the cause of the fd leak.