brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java
Apache License 2.0
710 stars 84 forks source link

Race condition between registration and onStart causes NPE on Linux #149

Closed avrecko closed 5 months ago

avrecko commented 5 months ago

Noticed on linux in very rare cases with very fast exiting processes. Sometimes a NPE is thrown because the EPoll thread already called exit and set the handler to null before the caller thread had a chance to call the onStart.

I think it makes sense to first call the onStart before doing the registration.

gagarski commented 5 months ago

@brettwooldridge Can you please review the change and if it's OK, release the changes? I am a bit tired of seeing NPEs in my log, which are hard to suppress :)

avrecko commented 5 months ago

@gagarski Out of curiosity, do you spawn processes that exit really fast? I think the only time I saw this NPE in non unit test is when there is a misconfiguration to the program and it exits right away with invalid arguments.

Doing some unit tests where I spawn a bunch of echo foos. Seeing the NPE very rarely but it does happen. Maybe 1x per 1000 calls.

gagarski commented 5 months ago

@avrecko I've seen it throwing from startup of ps ax process, which is quite short-running but should run a bit longer than echo foo. I believe this also happened during the startup of dozens of other processes. CPU pressure from their own (the processes themselves) work might be also a factor.

I've also never seen it before we've upgraded from quite an ancient OS (with Linux 3.0), so something might've changed it Linux since then.

gagarski commented 5 months ago

@brettwooldridge I can see the release being performed in the commit log, but cannot find it on Maven Central. Perhaps you forgot to promote staging repo?

brettwooldridge commented 5 months ago

The publication failed due to an issue with the signing certificate. Before I can publish, I will need to generate and register a new certificate. I will try to get around to it sometime this week

wkritzinger-atlassian commented 3 months ago

@brettwooldridge We were looking to upgrade to this version that contains this fix as we saw some of our customers and ourselves potentially hit this case.

It seems that you didn't get around to releasing this version yet, is there a way we can assist with the release?

bturner commented 3 months ago

Hello @wkritzinger-atlassian. Unless @brettwooldridge indicates otherwise, I can take a stab at doing this release tomorrow. I've done releases more recently than he has, so hopefully my release environment still works. (It'll be a good forcing function for me to also follow up on some other pull requests here I've been neglecting.)

wkritzinger-atlassian commented 3 months ago

Hi @bturner! Good seeing you are around these parts still! We did an internal release and our Linux tests keep failing, we are still investigating what the cause is, but until we get to the bottom of it I think we can hold off trying to do an official release of 2.0.7.

On Linux the com.zaxxer.nuprocess.RunTest hangs indefinitely with these exceptions thrown

com.zaxxer.nuprocess.internal.BasePosixProcess callStart
-- | --
WARNING: Exception thrown from handler
java.lang.NullPointerException
 at com.zaxxer.nuprocess.internal.BasePosixProcess.wantWrite(BasePosixProcess.java:189)
 at com.zaxxer.nuprocess.RunTest$LottaProcessListener.onStart(RunTest.java:411)
 at com.zaxxer.nuprocess.internal.BasePosixProcess.callStart(BasePosixProcess.java:587)
 at com.zaxxer.nuprocess.linux.LinuxProcess.run(LinuxProcess.java:108)
 at com.zaxxer.nuprocess.linux.LinProcessFactory.runProcess(LinProcessFactory.java:50)
 at com.zaxxer.nuprocess.NuProcessBuilder.run(NuProcessBuilder.java:275)
 at com.zaxxer.nuprocess.RunTest$1.run(RunTest.java:86)
wkritzinger-atlassian commented 3 months ago

I just confirmed that this change breaks NuProcess tests on Linux.

I think it makes sense to first call the onStart before doing the registration.

This causes a common usage pattern to break, it's also used by tests. Calling wantWrite() within onStart() will now cause an NPE. wantWrite() requires the event processor to be set, but with this change it's only set after onStart() is called.

I think this change should be reverted.

bturner commented 3 months ago

@brettwooldridge & @wkritzinger-atlassian, I'm going to revert merging this pull request and see if I can find another solution. If you want to handle it, Brett, just let me know. It's midnight for me at the moment, so I won't actually start making changes for ~18 hours or more (bedtime and work tomorrow before I can get to this).

bturner commented 3 months ago

Nevermind; I went ahead and did it because I'm still awake.

One thing I can't fix is that the CircleCI build appears to be unhappy. It looks like whatever SSH setup it was configured to use to clone the repository no longer works. I can see that Brett changed the POMs to use an HTTPS URL (which I'd prefer to revert, if I'm honest; I'd rather use SSH), but that doesn't change the CircleCI configuration. I don't have any sort of permissions to the CircleCI setup, though, so there's nothing I can do about it but observe the failure:

Writing SSH key for checkout to "/tmp/.circleci-ssh815780219/id_rsa"
Writing SSH public key for checkout to "/tmp/.circleci-ssh815780219/id_rsa.pub"
Cloning git repository
Cloning into '.'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error running git clone "git@github.com:brettwooldridge/NuProcess.git": exit status 128

error running git clone "git@github.com:brettwooldridge/NuProcess.git": exit status 128
avrecko commented 3 months ago

I think revert is a good idea as well. Like they say do it right or do it twice. I optimized for my use case (non writing).

I think we all see there is a race condition going on. Just a question of how to go about it.

What do you guys think of moving the responsibility of calling the onStart handler method from the caller thread to the event processing thread?

The way I am understanding the code:

interface NuProcessHandler

onPreStart -> called from caller thread
onStart -> called from caller thread --- change this to processor thread?
onExit -> called from processor thread / exceptionally called from caller thread / called from jvm shutdown hook
onStdout -> called from processor thread
onStderr -> called from processor thread
onStdinReady -> called from processor thread

aside: Looking at the case where epoll registration fails. Do we need to call destroy? It looks like onExit(Integer.MIN_VALUE) is called and null is returned. But not seeing any destroy. Is this intended? Never observed any issues with epoll registration.