Open green-nsk opened 2 years ago
The code in the linux backend is assuming linux/glibc semantics not POSIX. It's unlikely we'll move away from this model. Can you elaborate what breaks? We may be able to suggest workarounds.
There're a lot of things that break, so I think the simplest workaround for us would be to patch uv_spawn() and replace vfork()
with fork()
(haven't tried it yet, but I expect it should work, with maybe some minor modifications).
In broad strokes, the implementation of accelerated sockets relies on keeping extra information about file descriptors and signals in userspace (and intercepting some system calls). When the child process shares userspace with the parent, those don't work as expected.
Take for example signal()
/sigaction()
calls. Supplied signal handler is stored in user-space. When a parent has a signal handler stored, and child process attempts to reset signal handler back to SIG_DFL
here, it ends up "resetting" parent signal handler as well. I haven't investigated other calls in much detail, but I expect the general scenario to be similar.
We're hitting this problem with exasock network stack from Cisco/ex-Exablaze. Here's relevant signal()/sigaction() code. But there're others with a similar architecture that may be hit by similar problems, e.g. openonload from Xilinx/ex-SolarFlare
Yes, I'm familiar with those techniques - it's usually a huge pile of hacks, that make assumptions that don't hold in any even reasonably complicated code base. Our recommendation to vendors we've worked with on these kinds of issues is to make sure to have an API that does not rely on any interception tricks that runtime systems that manage more complicated state can opt into.
Yes, I'm familiar with those techniques - it's usually a huge pile of hacks, that make assumptions that don't hold in any even reasonably complicated code base. Our recommendation to vendors we've worked with on these kinds of issues is to make sure to have an API that does not rely on any interception tricks that runtime systems that manage more complicated state can opt into.
Hi @Keno. Agreed on the ideal state and what ought to be. However, vendors cater to customers, and a large fraction of customers want a drop in preload library that transparently runs their network stack faster. This has been the de facto solution to user space networking for quite some time, and while it often comes with some gotchas and limitations, I've also seen plenty complicated code bases run perfectly fine over it.
Out of interest, for this specific issue, what was the rationale for using vfork
on linux?
Out of interest, for this specific issue, what was the rationale for using
vfork
on linux?
vfork
is significantly faster, particularly on applications that have a lot of memory mappings (as julia applications often do), because the page tables need not be copied.
vfork
is significantly faster, particularly on applications that have a lot of memory mappings (as julia applications often do), because the page tables need not be copied.
Thanks, that makes sense.
I have some code that switches to using posix_spawn from glibc, but note that also requires vfork, and the implementation looks pretty similar to our implementation here. But perhaps it would avoid the symbol imposition from happening with the signal code?
I would expect system calls happening within posix_spawn() won't be intercepted by the exasock
. That way, the parent process memory will not be mutated. In that sense, using posix_spawn() would be better for our case.
That said, the state of signals/file descriptors tracked by exasock inside the child process may diverge from their "true" state. This may backfire in more subtle ways. but realistically I don't expect that to be a problem.
In the meanwhile, switching from vfork()
to fork()
has indeed fixed all the issues we're seeing.
I think some of the process initialization code looks unstable in relation to vfork. Specifically, https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L450 and https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L452. These two lines modify the pipes array in the parent process memory and replace the file handles with new handles created by F_DUPFD in the child process. These handles do not exist in the parent space, so uv__process_open_stream will fail when attempting to close the non-existing handle in https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.2/src/unix/process.c#L221.
I agree, that looks like a bug
Julia version of libuv uses
vfork()
foruv_spawn()
implementation, but contrary tovfork()
manpage it manages file descriptors and signal handlers prior to callingexec()
. This might work fine with vanilla kernel/libc, but breaks in our environment. We use custom code for accelerated network sockets, which breaks when the child process shares address space with the parent process.https://github.com/JuliaLang/libuv/blob/julia-uv2-1.44.1/src/unix/process.c#L990