JuliaLang / libuv

Cross-platform asynchronous I/O
http://libuv.org/
MIT License
9 stars 14 forks source link

usage of vfork() is non-compliant #22

Open green-nsk opened 2 years ago

green-nsk commented 2 years ago

Julia version of libuv uses vfork() for uv_spawn() implementation, but contrary to vfork() manpage it manages file descriptors and signal handlers prior to calling exec(). This might work fine with vanilla kernel/libc, but breaks in our environment. We use custom code for accelerated network sockets, which breaks when the child process shares address space with the parent process.

https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.1/src/unix/process.c#L990

Standard description
       (From POSIX.1) The vfork() function has the same effect as
       fork(2), except that the behavior is undefined if the process
       created by vfork() either modifies any data other than a variable
       of type pid_t used to store the return value from vfork(), or
       returns from the function in which vfork() was called, or calls
       any other function before successfully calling _exit(2) or one of
       the exec(3) family of functions.
Keno commented 2 years ago

The code in the linux backend is assuming linux/glibc semantics not POSIX. It's unlikely we'll move away from this model. Can you elaborate what breaks? We may be able to suggest workarounds.

green-nsk commented 2 years ago

There're a lot of things that break, so I think the simplest workaround for us would be to patch uv_spawn() and replace vfork() with fork() (haven't tried it yet, but I expect it should work, with maybe some minor modifications).

In broad strokes, the implementation of accelerated sockets relies on keeping extra information about file descriptors and signals in userspace (and intercepting some system calls). When the child process shares userspace with the parent, those don't work as expected.

green-nsk commented 2 years ago

Take for example signal()/sigaction() calls. Supplied signal handler is stored in user-space. When a parent has a signal handler stored, and child process attempts to reset signal handler back to SIG_DFL here, it ends up "resetting" parent signal handler as well. I haven't investigated other calls in much detail, but I expect the general scenario to be similar.

We're hitting this problem with exasock network stack from Cisco/ex-Exablaze. Here's relevant signal()/sigaction() code. But there're others with a similar architecture that may be hit by similar problems, e.g. openonload from Xilinx/ex-SolarFlare

Keno commented 2 years ago

Yes, I'm familiar with those techniques - it's usually a huge pile of hacks, that make assumptions that don't hold in any even reasonably complicated code base. Our recommendation to vendors we've worked with on these kinds of issues is to make sure to have an API that does not rely on any interception tricks that runtime systems that manage more complicated state can opt into.

ancapdev commented 2 years ago

Yes, I'm familiar with those techniques - it's usually a huge pile of hacks, that make assumptions that don't hold in any even reasonably complicated code base. Our recommendation to vendors we've worked with on these kinds of issues is to make sure to have an API that does not rely on any interception tricks that runtime systems that manage more complicated state can opt into.

Hi @Keno. Agreed on the ideal state and what ought to be. However, vendors cater to customers, and a large fraction of customers want a drop in preload library that transparently runs their network stack faster. This has been the de facto solution to user space networking for quite some time, and while it often comes with some gotchas and limitations, I've also seen plenty complicated code bases run perfectly fine over it.

Out of interest, for this specific issue, what was the rationale for using vfork on linux?

Keno commented 2 years ago

Out of interest, for this specific issue, what was the rationale for using vfork on linux?

vfork is significantly faster, particularly on applications that have a lot of memory mappings (as julia applications often do), because the page tables need not be copied.

ancapdev commented 2 years ago

vfork is significantly faster, particularly on applications that have a lot of memory mappings (as julia applications often do), because the page tables need not be copied.

Thanks, that makes sense.

vtjnash commented 2 years ago

I have some code that switches to using posix_spawn from glibc, but note that also requires vfork, and the implementation looks pretty similar to our implementation here. But perhaps it would avoid the symbol imposition from happening with the signal code?

green-nsk commented 2 years ago

I would expect system calls happening within posix_spawn() won't be intercepted by the exasock. That way, the parent process memory will not be mutated. In that sense, using posix_spawn() would be better for our case.

That said, the state of signals/file descriptors tracked by exasock inside the child process may diverge from their "true" state. This may backfire in more subtle ways. but realistically I don't expect that to be a problem.

In the meanwhile, switching from vfork() to fork() has indeed fixed all the issues we're seeing.

rshpount commented 1 year ago

I think some of the process initialization code looks unstable in relation to vfork. Specifically, https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L450 and https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L452. These two lines modify the pipes array in the parent process memory and replace the file handles with new handles created by F_DUPFD in the child process. These handles do not exist in the parent space, so uv__process_open_stream will fail when attempting to close the non-existing handle in https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L221.

Keno commented 1 year ago

I agree, that looks like a bug