`NuProcessBuilder.start` does not return `null`, if process cannot be started

brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java

Apache License 2.0

710 stars 84 forks source link

`NuProcessBuilder.start` does not return `null`, if process cannot be started #114

Open thraidh opened 4 years ago

thraidh commented 4 years ago

According to the documentation NuProcessBuilder.start should return null, if there is an immediately detectable launch failure.

Unfortunately (at least on Linux) it does not.

The problem is LinProcessFactory.createProcess, where

      LinuxProcess process = new LinuxProcess(processListener);
      synchronized (LinProcessFactory.class) {
         process.start(commands, env, cwd);
      }
      return process;

should be

      LinuxProcess process = new LinuxProcess(processListener);
      synchronized (LinProcessFactory.class) {
         return process.start(commands, env, cwd);
      }

as process.start correctly returns null or the actual process instance. The factory on the other hand always returns the process instance, even if it could not be started.

thraidh commented 4 years ago

Just found that #66 and #67 would probably be fixed by changing LinProcessFactory to do what the documentation says. Also the unit test should probably be fixed, too. Since the executable should not be there, process should be null. The unit test assumes it is not null and tries to call isRunning() on it.

thraidh commented 3 years ago

I'm not sure if this should be fixed, because it might break programs, that depend on NuProcessBuilder.start to never return null, even if the documentation says otherwise.

sschuberth commented 3 years ago

So, what is the correct way then to check if the process has been correctly started? Calling isRunning() might not work if the process already (successfully) finished. Should we compare the exit code to Integer.MIN_VALUE?

sschuberth commented 3 years ago

Also see https://github.com/brettwooldridge/NuProcess/issues/67, BTW.

bturner commented 3 years ago

Per @brettwooldridge on #66, the way to tell it wasn't started is to have your NuProcessHandler implementation check for onExit being called without onStart having been called first. That's what I've ended up doing for my handlers.

thraidh commented 3 years ago

The correct way depends on the definition or view of "correct".

Architecturally the best way would be to fix this as I wrote above. Unfortunately in the real world that might break existing programs, which means that it can't or should not be done. Without any changes the best/only way to do it is described in #66, but this is actually a workaround, because the information that we actually need, was just thrown away and we have to jump through hoops to reconstruct it.

By now, I believe the best way to fix this problem is to change the documentation to reflect the current behavior and add another property to ProcessFactory that indicates that process start failed. That property would be true if LinuxProcess.start returns null.

bturner commented 3 years ago

"Correct" means the way the library author has indicated it should be done, regardless of whether we think that's the best way or not. To that definition, there is only one correct way.

Returning null is a way that could be "correct" if the library was changed to function differently, but it's obviously not "correct" for the way the library is implemented now. Personally, if the library was going to change, I'd much prefer to see it throw an exception if the process can't be started, consistent with how ProcessBuilder.start works, rather than returning null.

Adding a property on ProcessFactory is a non-starter, though, because the factory can be used to create multiple processes and can be used to do so concurrently. That createProcess internally blocks doesn't mean the method can't have concurrent callers who see near-concurrent returns. That means checking the flag would be impossible to do safely. I'll add that runProcess, which has the same issue around detecting whether the process could be started or not, couldn't be "fixed" with a null return (since it returns void, because whether the process starts or doesn't it's always complete and no longer running when runProcess returns). It could be fixed to throw an exception, though. And since runProcess also doesn't block, a property on the ProcessFactory to detect failures to start would be even more racy for that than it would be for createProcess.

thraidh commented 3 years ago

I agree. ProcessFactory is obviously the wrong place. Apparently I rolled a natural one on Computer Science.

I disagree about what the "correct" way is, though. The author has stated his intention in the documentation, which clearly states that NuProcessBuilder.start should return null, but since that does not work, he has given a workaround. Unfortunately that is currently also the only possible way, but that is not necessarily correct. To give a stupid example: consider an online form, that asks for your age and give the options "under 30" and "31 or older". If you are exactly 30, there is no correct way to fill out the form, but you can work around the problem by lying and pretending you are 31. Like the form, this library contains an error and the correct thing is to do is to fix the error. Until that is done, you can tell people how to work around the problem, but that should not be considered the "correct" way. But in the end this is a philosophical problem and of no real relevance.

To get technical again: if NuProcessBuilder.start would throw an exception, that would cause the same problems as it returning null. Existing programs will not expect that and might stop working for cases that do work now. If everything stays the same, just NuProcess (instead of ProcessFactory) gets a new property/accessor, so you can query it for startup failure, then every existing program will continue to work and future programs can make use of the new property and do not have to do a workaround anymore.

brettwooldridge commented 3 years ago

@thraidh Looking at the code, it appears that only the Windows implementation does not return null from start(). Is that the issue you are reporting? If that is the case, that is due to a limitation of Windows, and a rather infamous limitation at that.

From the Windows createProcessA documentation:

Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated. To get the termination status of a process, call GetExitCodeProcess.

This means that createProcessA will return "successful launch!" only to have the process fail some tens or hundreds of milliseconds later due to a DLL failure. And there is no API to tell you "initialization has completed".

So, you would have to poll "GetExitCodeProcess" in start(), for some arbitrary amount of time. So, how long do you wait? 100ms? Initialization could still fail after that. One second? Still no guarantee that DLL initialization is complete.

Whatever you can conceive of that would allow a return of null from start() on Windows, would place a ludicrous limitation on scalability. NuProcess is meant to be a non-blocking paradigm, blocking in start() is, to make a pun, a non-starter.

This means there is no cross-platform way to either return null or throw an exception.

The decision was made long ago to return the NuProcess instance so that, for example, the user can poll the isRunning() method or call waitFor() with some timeout. But as noted in #66, the best way to detect failure is the onExit() method in the callback.

The JavaDoc is incorrect, and I will fix it.

thraidh commented 3 years ago

The problem is, that NuProcessBuilder.start's documentation says, that it will return null on a failed start. OSX' and Linux' NuProcess implementations also support that, but LinProcessFactory is discarding that piece of information and always returns the process, even if process.start actually did return null. Since Windows does not support detection of startup failure I'm happy with the current behavior in the Windows case.

Therefore making it work like the documentation says, would be very easy. The only problem is, that it could break existing programs, that are ignoring the documentation and assume, it would always work. Interestingly this includes the unit tests.

In my opinion the way to go forward is to fix the documentation and add another feature to NuProcess, where you can query for startup failure, so the information that is available for Linux and OSX is not discarded and needs to be reconstructed with some workaround.

That would be just in LinProcessFactory:

   @Override
   public NuProcess createProcess(List<String> commands, String[] env, NuProcessHandler processListener, Path cwd)
   {
      LinuxProcess process = new LinuxProcess(processListener);
      synchronized (LinProcessFactory.class) {
-         process.start(commands, env, cwd);
+         process.setStartupFailed( process.start(commands, env, cwd) == null);
      }
      return process;
   }

and of course the addition of a single boolean property to NuProcess and its implementations.

bturner commented 3 years ago

There's not really any reason for the factory to get involved in setting such a boolean; it's calling process.start, which can (and, on Linux and macOS, already does) detect and handle startup failures internally--otherwise it couldn't be returning null. That means it can set the flag itself, internally, and not expose that mutator above its API, be it public or internal API.

That might fix your "pet hate" and your use case, but there are still some pretty significant problems with it:

It does nothing to facilitate startup detection for runProcess (i.e. NuProcessBuilder.run) callers, where the NuProcess is never returned
It presents an API on NuProcess that does not work the same on every OS NuProcess supports, whereas the onStart/onExit approach does work the same on every supported OS (although, see my edit below)

It's not my library, but I'd prefer a documentation change to remove the incorrect assertion that NuProcessBuilder.start returns null on startup failure to one that adds an API that only works on 2/3s of the supported platforms and only works for half the ways the library can be used.

Edit: Actually, I suspect for the sort of delayed initialization error described in the CreateProcessA/CreateProcessW documentation, the onStart/onExit check wouldn't work any differently from a null return, or a new flag, or throwing an exception. If CreateProcessW doesn't return 0, WindowsProcess always calls onStart.

bturner commented 3 years ago

This means that createProcessA will return "successful launch!" only to have the process fail some tens or hundreds of milliseconds later due to a DLL failure. And there is no API to tell you "initialization has completed".

That's true, but there are other failure modes where startup detection does work--and that NuProcess's WindowsProcess already detects. For example, if the path doesn't exist, CreateProcessW returns 0 and WindowsProcess.prepare throws an exception which start and run both catch and log. For those types of failures, WindowsProcess.start could return null; like LinuxProcess and OsxProcess both already do.

Edit: Given WindowsProcess always calls onStart if CreateProcessW returns non-zero, that implies there's no possible way to detect a delayed initialization failure on Windows even using the NuProcessHandler; it's always going to report the process was started.

bturner commented 3 years ago

So, you would have to poll "GetExitCodeProcess" in start(), for some arbitrary amount of time. So, how long do you wait? 100ms? Initialization could still fail after that. One second? Still no guarantee that DLL initialization is complete.

This got me curious, because I doubted process start on Linux handles this case either, so I wrote a simple test and ran it on Linux to see what would happen.

First, I created an incredibly basic shared library:

[bturner@auvry ~]$ cat shared.h 
extern void say_hello();
[bturner@auvry ~]$ cat shared.c
#include <stdio.h>

void say_hello() {
    printf("Hello, World!");
}

[bturner@auvry ~]$ cat test.c 
#include "shared.h"

int main(int argc, char *argv[]) {
    say_hello();

    return 0;
}

[bturner@auvry ~]$ gcc -shared -fPIC -o libshared.so shared.c
[bturner@auvry ~]$ gcc -L. -lshared test.c

Running that without ~ in my LD_LIBRARY_PATH produces this:

bturner@auvry ~]$ ./a.out 
./a.out: error while loading shared libraries: libshared.so: cannot open shared object file: No such file or directory

If I run that same process using NuProcess, as I expected:

LinuxProcess.start (note: not LinProcessFactory.createProcess--the LinuxProcess.start method it calls) returns non-null
My NuProcessHandler.onStart is invoked
My NuProcessHandler.onStderr is invoked with: "/home/bturner/a.out: error while loading shared libraries: libshared.so: cannot open shared object file: No such file or directory\n"
My NuProcessHandler.onExit is invoked with Integer.MIN_VALUE

So the sort of case the CreateProcessA/CreateProcessW documentation is describing is not detected on Linux either. If a shared object for a given executable can't be found, the fork+exec (or, more exactly, the JVM's ProcessImpl.forkAndExec (or UNIXProcess.forkAndExec prior to Java 10), does not detect the failure either.

To me, this implies there's no appreciable difference between NuProcess's ability to detect startup failure on Windows vs. any other OS. That further implies that there should be a mechanism for relaying failure to start for the cases that can be detected that works on all platforms NuProcess supports and, ideally, that would work for both start and run--not just start.

brettwooldridge commented 3 years ago

@bturner @thraidh First, we're all agreed that the JavaDoc does not reflect the current behavior and should be fixed. My proposal is this:

Add a method on NuProcess to the effect of isStartFailure() or didStartFail(). Suggestions welcome.
Add a default interface method to NuProcessHandler, onStartFail(int).

This provides a handling route for direct interrogation or callback-based notification.

In both cases, the setting of state or callback notification will only occur in the case of synchronous failure, i.e. executable not found. The rub is that neither can be done in the case of post-launch initialization failure, such as missing shared library.

From the standpoint of my company's use-case, this notification is not particularly informative. A non-zero exit code is the only thing we care about, or from understanding our own expectation of process interaction, the premature exit of a process whether non-zero exit or not.

EDIT: If desired, the NuProcess method can return an int rather than a boolean.

bturner commented 3 years ago

Adding a default method on the interface would require making a binary-incompatible change to raise the minimum JDK version for NuProcess from 7 to 8. I don't personally have any objections to that, to be clear; it just seems worth considering.

My personal preference is for a behavior change: Throw an exception if the process doesn't start. The NuProcess code already does this, so it's mostly "just" a matter of letting that exception through.

Of course, similar to raising the minimum JDK version, that behavior change would be semantics-breaking. My proposal is to make it opt-in:

Add a new method on NuProcessBuilder, something like checkStarted, requireStart, throwOnStartFailure, something along that line
- By default, the new setting would use whatever value preserves the existing behavior of not throwing
- This means there are no semantics-breaking changes
Update the *Process classes to accept and apply the new flag

This way, for use cases that don't need to differentiate failure to start there's no change in behavior, and for use cases that do they can opt into receiving an exception. (Perhaps an IllegalStateException? Or some new strongly typed (but unchecked) exception.)

One reason I like this proposal is because it works just as well for NuProcessBuilder.run as it does for NuProcessBuilder.start. Since the library supports both synchronous and asynchronous usage, I very much prefer a solution that treats both usages equally. I also like that it doesn't require the indirection of going through the NuProcessHandler to get the information--it's provided directly to the caller. Otherwise the caller needs to retain a reference to their NuProcessHandler so they can check some secondary piece of state after start (or run) returns.

That's my 2 cents, for whatever it's worth.

brettwooldridge commented 3 years ago

@bturner Yes, I understand regarding binary compatibility. A change in the major digit of the semantic version is required. As Java 7 is past end of life, I have no problem with that.

Thanks for the feedback, I want some time to consider all of this.

My overall reservation is this. While it seems like solution, it is only partially so, because it doesn't cover (and can't cover) all "launch failures". Most users, including myself, likely consider failure due to a missing shared library as a semantically identical lunch failure to a missing executable. I posit that executables that depend upon shared libraries, whether OS or internal, far outnumber those that are fully statically linked, in the wild.

The end result is that the user ultimately has to handle both failure modes, as handling only failures such as missing executables by catching an exception would basically be a bug in the user's code -- missing an entire class of launch failures.

Given that fact, is it not the case that the existing recommended pattern covers both cases? Handling failures in onExit() covers all cases, yes? As stated above, this change would have the user handling semantically identical launch failures in two places in their code. And even while semantically identical, the uniqueness of each failure mode can still be distinguished by exit code in onExit().

bturner commented 3 years ago

I'm not sure that I agree that a binary that exists but is corrupted in a way that prevents it from starting (e.g. missing some dynamic libraries) is semantically the same as a binary which does not exist at all. If it were a web page, you'd expect a 404 for the missing URL (no such binary), but a different error (5XX probably) for it being in an invalid state. With the current NuProcess code it's not really possible to differentiate between the two.

Personally, to be clear, I don't really have a significant investment here. My handlers are already coded to check for onExit without onStart and understand that it means the program couldn't start. But there are a range of launch failures that I think could be trivially messaged to the caller, without requiring the indirection of dealing with the NuProcessHandler and having to interrogate its state, and I think there is some value in that.

Put differently, I think there's value in NuProcess having a mode that's consistent with ProcessBuilder. Launch failures like missing binaries throw exceptions, with ProcessBuilder.start, and so that's a model with which developers likely already have experience and are already comfortable.

sschuberth commented 3 years ago

My personal preference is for a behavior change: Throw an exception if the process doesn't start. The NuProcess code already does this, so it's mostly "just" a matter of letting that exception through.

To chime in from the perspective of another (future) user of this library: I'd agree here. As a user I don't really care why the process didn't start, i.e. whether it's due to a synchronous error (executable not found) or an asynchronous error (DLL / shared object not found after launch). While the difference might be valuable information to the end-user (i.e. the user of my app using NuProcess) to debug the issue, just forwarding the underlying exception message probably gives enough clue to the end-user. IMO there's no need for NuProcess to explicitly expose the difference between the synchronous and asynchronous error cases when a process starts.

Most users, including myself, likely consider failure due to a missing shared library as a semantically identical lunch failure to a missing executable.

Agreed, same here.

I'm not sure that I agree that a binary that exists but is corrupted in a way that prevents it from starting (e.g. missing some dynamic libraries) is semantically the same as a binary which does not exist at all.

It's semantically equivalent on a higher level. Think of it like running (JUnit) tests: There are test failures and test errors. Test failures mean that the test was successfully run, but the actual results did not match the expected results. Test errors mean that there was a more fundamental problem in running the test framework itself, like a misconfiguration or a bug in JUnit. These are the two semantically different error / failure cases I'd like to distinguish on my end.

So again from a fresh / new users perspective, I was puzzled to see that there are cases where onExit() is being called while onStart() was not. You cannot exit from something that hasn't even started yet.

Bottom line, in my personal opinion there should be a contract saying that onExit() will only ever get called if also onStart() was called before. And to distinguish between the error / failure semantic I described above, any (a)synchronous launch error should throw an exception (only), and onExit() should not be called in such cases. Note that it might be though (if not possible to implement otherwise on Windows) that onStart() gets called, but onExit() not, as an exception gets thrown in between because "a required DLL cannot be located or fails to initialize".

brettwooldridge commented 3 years ago

@sschuberth I think you've missed a key point of the discussion. That being that a failure caused by, for example, a missing shared library, cannot be thrown as an exception from start(), because the native "create process" function returns without error. Only a "synchronous" failure, in which create process itself fails, is feasible to throw as an exception from start().

NuProcess started life as an entirely asynchronous framework. The only "notification" of any activity was through the callback API. Some users lobbied to fully support single-threaded execution, and thus this functionality was basically "bolted on" after the fact, and an attempt made to disrupt the code as little as possible in so doing.

The invocation of the onStart() callback implies that create process succeeded, there was a PID generated by the operating system, and at least as far as start() is concerned, the process is off and running. In the case of a missing executable, onStart() cannot (should not) be called, because the process entirely and knowably failed to start.

Therefore, in the asynchronous handler model, it makes perfect sense for there to exists scenarios in which the onStart() callback is never invoked, i.e. no PID was ever generated, but because the user needs to know this fact, the onExit(exitcode) callback is used to notifiy the failure. An argument could be made that onExit() should have been named more genercially onComplete(), or something with less implication that there was a launch and therefore now exit, but that ship sailed long ago.

Of course, one could introduce a separate callback, onLaunchFailure(int), for that specific case. However, given that the signature of the onExit() and onLaunchFailure() methods would be identical, and that a delayed failure cannot be detected during launch and thus onLaunchFailure() cannot be called in that case, such a method to this engineer is superfluous.

In point of fact, a "delayed launch failure" (e.g. missing shared library) is indistinguishable from any other process exit, including normal process exit. Only the user can interpret the exitCode parameter in their particular use-case to determine whether the exit was normal or abnormal.

The only debate resolves around whether certain launch failures, i.e. the synchronous variety (no executable), should be immediately reported by throwing an exception from start(). I am onboard with Brian's suggestion of backward compatible NuProcessBuilder.throwOnStartFailure() method that would create this behavior for users who desire it.

My point in all of my debate on this topic is that because the user still has to handle delayed launch failures via a callback, the prospect of handling failures in two places in the code rather than a single place, is less desirable. Again, at least in the opinion of this engineer. But again I am amenable to implementing a non-breaking change in behavior.

thraidh commented 3 years ago

In my use case users can configure a script that is called as a callback for certain actions. Since users can configure the called programs themselves, it is a very common situation that the called program either does not exist, is misspelled or not executable.

Today all these are caught with the "on(Pre)Start-not-called" method, but that is not very discoverable. Therefore I'd like to have a method like failedToStart() on NuProcess as you suggested above. I think if (process.failedToStart()) sounds nicer than if (process.didStartFail()), but I won't fight to the death for that name.

With that method an optional exception can be as easy as:

process=builder.start();
if (process.failedToStart()) throw new MyStartupFailedException();

but I see the allure of the library providing that, and you just need

process=builder.throwOnStartFailure().start();

Anyway, the exception should be in addition to the failedToStart method.

Regarding the handler, it seems to be a little bit more complicated. If you do not want to break compatibility, onExit needs to be called in addition to onLaunchFailure and not instead. Anyway, that would make things even more complicated, because if you wanted to handle launch failures you'd have to set a flag or something to communicate to onExit that this exit was already handled. I see two solutions:

Have onLaunchFailure return a boolean whether onExit should be called or not (the default would be yes).
Add a second handler interface inheriting from the NuProcessHandler. That interface would just add onLaunchFailure, but if it is used, it would trigger a different behavior, so onLaunchFailure is called instead of onExit, if there was a launch failure.

The second option would also not require Java 8, because you do not need a default method in the interface.