Open NorfairKing opened 3 years ago
I tried reproducing this problem with process
(without typed-process
) but could not reproduce the problem like this:
#!/usr/bin/env stack
-- stack --resolver lts-16.31 script
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad
import qualified Data.ByteString as SB
import System.IO
import System.Mem
import System.Process
main :: IO ()
main =
forM_ [1 .. 10000] $ \i -> do
withCreateProcess ((proc "cat" []) {std_in = CreatePipe, std_out = CreatePipe, std_err = Inherit}) $ \(Just inHandle) (Just outHandle) _ ph -> do
putStrLn $ "Iteration: " <> show i
let contents = "Hello world\n"
SB.hPut inHandle contents
hFlush inHandle
output <- SB.hGet outHandle (SB.length contents)
when (output /= contents) $ fail "failed to read."
This is pretty easy to turn into a failing test. I wouldn't mind making a PR for that if it would help!
I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin
in the same thread that reads from stdout
, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.
Reproducing case would be appreciated. I'll see if something jumps out at me in the code.
Also, it makes sense you can't reproduce with that process
example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).
Strangely enough, making this seemingly innocuous change seems to fix the problem on my system, can you confirm?
@@ -768,6 +768,7 @@ startProcess pConfig'@ProcessConfig {..} = liftIO $ do
-- then call waitForProcess ourselves
Left _ -> do
eres <- try $ P.terminateProcess pHandle
+ P.getProcessExitCode pHandle
ec <-
case eres of
Left e
My theory: there's a subtle race condition in process
where it's not fulfilling its responsibilities around waitForProcess
being idempotent in some cases. Calling getProcessExitCode
first forces the MVar
to end up in a valid state, making the later call to waitForProcess
succeed reliably.
The bug is definitely in process
, and it's an async exception bug. It's exactly the kind of bug that typed-process
is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process
.
Issue: we want waitForProcess
to be interruptible. If it gets interrupted, we never get a chance to see the exit code. But if the waitpid
system call succeeds, then the process is removed from the process table and can never be queried again.
In our case, the polling thread in typed-process is calling waitForProcess
and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess
isn't actually async-exception-safe, none of this works.
I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process
to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.
I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin in the same thread that reads from stdout, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.
OH, good point! I hope this isn't an issue in this simple case. I just needed some example of interacting with an external process so I figured cat
was the simplest example but I'm open to other suggestions.
Also, it makes sense you can't reproduce with that process example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).
Wait what? I just had another look at the source and withCreateProcess
calls cleanupProcess
in a bracket
(http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#withCreateProcess) and cleanupProcess
calls terminateProcess
(http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#cleanupProcess).
Am I missing something?
My theory: there's a subtle race condition in process where it's not fulfilling its responsibilities around waitForProcess being idempotent in some cases.
Yes that seems to match my intuition as well.
The bug is definitely in process, and it's an async exception bug. It's exactly the kind of bug that typed-process is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process. I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.
Ah shoot, that's what I was afraid of. To be quite honest, I'm not convinced that I fully grasp what's going on here, but I definitely expected my example to "just work" or at least not crash. (I'm using a fancy withX
function afterall..)
Whatever we do to solve this issue, could we please make sure to have a regression test for this behaviour if we decide that this example shouldn't crash? I'd be happy to write the necessary tests.
(P.S. I still don't understand why my process
example doesn't exhibit the problem.)
Side note: This
output <- SB.hGet outHandle (SB.length contents) when (output /= contents) $ fail "failed to read."
need not work. As far as I can tell, hGet
reads "up to n
" bytes, so output
may well be Hello wo
if the kernel decides so.
You need SB.hGetContents
if you want to loop-read until the end.
Also nice repro!
I'm not sure I fully understand the problem, but would this idea work?
When running a process, immediately do an internal forkIO
that just calls the waitpid
system call. Then this mini-thread will never be interrupted directly with an async exception (and it can be internally interacted with via STM or MVars)
See my comment above
In our case, the polling thread in typed-process is calling waitForProcess and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess isn't actually async-exception-safe, none of this works.
This is the same as #69.
What's happening here @NorfairKing with your original example is that:
hGet
returns with some, but not all of the output (as @nh2 already pointed out)when (output /= contents) $ fail "failed to read."
will result in an exceptionpCleanup
, erroneously calling waitForProcess
twice. This results in a second exception does not exist (No child processes)
, which masks the first one, resulting in the bug.Or in other words: The bug is not that you get an exception in the first place, the bug is that you get the wrong exception.
A minimal reproducing example is at: https://github.com/fpco/typed-process/issues/69#issue-1705467406
Exactly how we mess up in pCleanup
I described in great detail here: https://github.com/fpco/typed-process/pull/70#issue-1705495438
However, because waitForProcess isn't actually async-exception-safe, none of this works.
@snoyberg we essentially have the following code (ignoring the fallback code for the single-threaded runtime):
ec <- unmask $ do
P.waitForProcess pHandle
-- Do we have any guarantee that we are not interrupted right here?
atomically $ putTMVar pExitCode ec
Do we have anything that guarantees us that we are not interrupted right after waitForProcess
returns? If not, then I think even if waitForProcess
were perfectly async-exception-safe, we would still have a bug, as at this point we lost the exit code: We successfully retrieved it from waitForProcess
, but we failed to put it into pExitCode
. Leaving pExitCode
empty.
If not, then I think even if
waitForProcess
were perfectly async-exception-safe, we would still have a bug
Actually no, if waitForProcess
were async-exception-safe and idempotent then the current code would probably work (modulo handling of delegate_ctlc
, which I think would still be broken). Somehow this slipped my attention. I naively assumed that waitForProcess
behaves like waitpid
here.
So, as I understand it, we have two issues:
waitForProcess
is not well behaved when it gets interrupted. It is documented to be idempotent, but it loses that property when subjected to async-exceptions.pExitCode :: TMVar ExitCode
I think it's evident that we have another bug. With this type waitExitCode
can only ever return a value, never propagate an exception. Now, what value should waitExitCode
return when there is no value? When waitForProcess
fails with an exception (as it does when delegate_ctlc = True
on ctrl-c
)? It will diverge, a.k.a. deadlock.For (1) is there an important reason that we have to interrupt waitForProcess
in the first place? If yes, for what exact reason? Can we name it?
Otherwise, how about simply not to interrupt waitForProcess
and leave all of the heavy lifting to async
:
terminateProcess
at which point waitForProcess
will return on its ownAsync.wait waitingThread
to get the exit codeFor (2), we could change the type of pExitCode
to TMVar (Either SomeException ExitCode)
to allow waitExitCode
to propagate exceptions (specifically UserInterupt
when delegate_ctlc = True
) instead of deadlocking. But then again, async
already provides exactly that functionality. So how about relying on async
yet again?
I'm not sure if this is supposed to go here or in
process
, but this code fails about 1 iteration in 200:I couldn't reproduce the failure without the interaction with the standard streams, so I may very well be doing something wrong, but every 200 iterations or so I see this: