Closed bhamiltoncx closed 7 years ago
Wouldn't it be easier (in the case of NuProcess) to simply block closeStdin() until a pending write completes?
Btw the former fix wasn't merely about Atomic variables, the actually make little/no difference. It was about closing handles after we're sure the won't be accessed again.
Originally the issue was discovered with a failed spawn, where the OS would promptly give the handles to the next process, and the failed process thread would then proceed to close them in cleanup.
It might be OK to block close if (for whatever reason) read and write are themselves not blocked. 2015年9月27日(日) 16:52 Brett Wooldridge notifications@github.com:
Btw the former fix wasn't merely about Atomic variables, the actually make little/no difference. It was about closing handles after we're sure the won't be accessed again.
Originally the issue was discovered with a failed spawn, where the OS would promptly give the handles to the next process, and the failed process thread would then proceed to close them in cleanup.
— Reply to this email directly or view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/48#issuecomment-143605359 .
@bhamiltoncx Before we make a fix for a theoretical race condition in NuProcess, and we make it an actual race condition? By which I mean, can you come up with a unit test that causes the failure, and then it is easy to test different strategies to address the race in the simplest way.
For example, do something like this:
process.writeStdin(buffer);
process.writeStdin(buffer);
process.writeStdin(buffer);
process.writeStdin(buffer);
process.closeStdin(true);
on multiple threads?
For sure.
Hi Ben, I was just reviewing pull requests and wondered if this was still relevant?
It should be! I didn't have time to make a repro but I am pretty certain this issue can happen if clients close on one thread while reading or writing on another. On Thu, Jul 14, 2016 at 18:25 Brett Wooldridge notifications@github.com wrote:
Hi Ben, I was just reviewing pull requests and wondered if this was still relevant?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/48#issuecomment-232829478, or mute the thread https://github.com/notifications/unsubscribe-auth/AApAU5UTi7e7YhOI5HC0MQxrRjeU-kd8ks5qVtNegaJpZM4GElGS .
I just got bitten by a slightly different version of this. For me, it happens fairly consistently if I writeStdin(data) and then immediately closeStdin(false). The result is the abbreviated stack trace:
Caused by: java.lang.RuntimeException: Unable to register new event to epoll queue
at com.zaxxer.nuprocess.linux.ProcessEpoll.queueWrite(ProcessEpoll.java:135) ~[nuprocess-1.1.0.jar:?]
at com.zaxxer.nuprocess.linux.ProcessEpoll.queueWrite(ProcessEpoll.java:40) ~[nuprocess-1.1.0.jar:?]
at com.zaxxer.nuprocess.internal.BasePosixProcess.closeStdin(BasePosixProcess.java:323) ~[nuprocess-1.1.0.jar:?]
The reason, presumably, is that the processing thread is reaching the tombstone and closing the FD after my thread fetches the stdin FD but before it reaches the epoll_ctl calls, so epoll_ctl fails because the FD is no longer valid.
For now, I'm working around the issue by sleeping while hasPendingWrites() and then doing closeStdin(true) to avoid the race.
POSIX requires that operations which return a new file descriptor (
open()
/dup()
/ etc.) return the lowest-available descriptor.This means there's a tricky race in
BasePosixProcess
:This fixes the race the same way the JRE's own
AbstractPlainSocketImpl
does, by putting a reference count around any read, write, or close operations. We'll delay the close until any pending reads or writes complete.Currently,
stdout
andstderr
are serially read and closed on the same thread as far as I can tell, but to future-proof this code, I applied the same fix to all three FDs.We could also fix this by always requiring
stdin
closing to be "soft", so we ensure it's always closed on the event processor thread (serial with any writes). That could be a less invasive change, so let me know what you prefer.Tested on OS X with
mvn test
. Haven't gotten to test on Linux yet.