Fix race condition if one thread closes an FD while another is about to read/write

bhamiltoncx commented 9 years ago

POSIX requires that operations which return a new file descriptor (open() / dup() / etc.) return the lowest-available descriptor.

This means there's a tricky race in BasePosixProcess:

Processor Thread (epoll/kevent):
  int stdinFd = stdin.get();
  (gets de-scheduled before calling LibC.write(stdinFd))
Thread 1:
  NuProcess.closeStdin(true)
  LibC.close(stdin.getAndSet(-1))
  (the kernel now marks stdinFd as free to re-use)
Any Thread Except Processor Thread:
  fd = open() (or dup(), etc.)
  (kernel will return the old stdin fd, to which the processor thread is about to write)
Processor Thread:
  (gets rescheduled)
  LibC.write(stdinFd)
  (writes garbage to someone else's open file descriptor)

This fixes the race the same way the JRE's own AbstractPlainSocketImpl does, by putting a reference count around any read, write, or close operations. We'll delay the close until any pending reads or writes complete.

Currently, stdout and stderr are serially read and closed on the same thread as far as I can tell, but to future-proof this code, I applied the same fix to all three FDs.

We could also fix this by always requiring stdin closing to be "soft", so we ensure it's always closed on the event processor thread (serial with any writes). That could be a less invasive change, so let me know what you prefer.

Tested on OS X with mvn test. Haven't gotten to test on Linux yet.

brettwooldridge commented 9 years ago

Wouldn't it be easier (in the case of NuProcess) to simply block closeStdin() until a pending write completes?

brettwooldridge commented 9 years ago

Btw the former fix wasn't merely about Atomic variables, the actually make little/no difference. It was about closing handles after we're sure the won't be accessed again.

Originally the issue was discovered with a failed spawn, where the OS would promptly give the handles to the next process, and the failed process thread would then proceed to close them in cleanup.

bhamiltoncx commented 9 years ago

It might be OK to block close if (for whatever reason) read and write are themselves not blocked. 2015年9月27日(日) 16:52 Brett Wooldridge notifications@github.com:

Btw the former fix wasn't merely about Atomic variables, the actually make little/no difference. It was about closing handles after we're sure the won't be accessed again.

Originally the issue was discovered with a failed spawn, where the OS would promptly give the handles to the next process, and the failed process thread would then proceed to close them in cleanup.

— Reply to this email directly or view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/48#issuecomment-143605359 .

brettwooldridge commented 9 years ago

@bhamiltoncx Before we make a fix for a theoretical race condition in NuProcess, and we make it an actual race condition? By which I mean, can you come up with a unit test that causes the failure, and then it is easy to test different strategies to address the race in the simplest way.

For example, do something like this:

process.writeStdin(buffer);
process.writeStdin(buffer);
process.writeStdin(buffer);
process.writeStdin(buffer);
process.closeStdin(true);

on multiple threads?

bhamiltoncx commented 9 years ago

For sure.

brettwooldridge commented 8 years ago

Hi Ben, I was just reviewing pull requests and wondered if this was still relevant?

bhamiltoncx commented 8 years ago

It should be! I didn't have time to make a repro but I am pretty certain this issue can happen if clients close on one thread while reading or writing on another. On Thu, Jul 14, 2016 at 18:25 Brett Wooldridge notifications@github.com wrote:

Hi Ben, I was just reviewing pull requests and wondered if this was still relevant?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/48#issuecomment-232829478, or mute the thread https://github.com/notifications/unsubscribe-auth/AApAU5UTi7e7YhOI5HC0MQxrRjeU-kd8ks5qVtNegaJpZM4GElGS .

rdicroce commented 7 years ago

I just got bitten by a slightly different version of this. For me, it happens fairly consistently if I writeStdin(data) and then immediately closeStdin(false). The result is the abbreviated stack trace:

Caused by: java.lang.RuntimeException: Unable to register new event to epoll queue
    at com.zaxxer.nuprocess.linux.ProcessEpoll.queueWrite(ProcessEpoll.java:135) ~[nuprocess-1.1.0.jar:?]
    at com.zaxxer.nuprocess.linux.ProcessEpoll.queueWrite(ProcessEpoll.java:40) ~[nuprocess-1.1.0.jar:?]
    at com.zaxxer.nuprocess.internal.BasePosixProcess.closeStdin(BasePosixProcess.java:323) ~[nuprocess-1.1.0.jar:?]

The reason, presumably, is that the processing thread is reaching the tombstone and closing the FD after my thread fetches the stdin FD but before it reaches the epoll_ctl calls, so epoll_ctl fails because the FD is no longer valid.

For now, I'm working around the issue by sleeping while hasPendingWrites() and then doing closeStdin(true) to avoid the race.

brettwooldridge / NuProcess

Fix race condition if one thread closes an FD while another is about to read/write #48