Race condition in `waitForProcess`

NorfairKing commented 3 years ago

I'm not sure if this is supposed to go here or in process, but this code fails about 1 iteration in 200:

#!/usr/bin/env stack
-- stack --resolver lts-16.31 script

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import qualified Data.ByteString as SB
import System.IO
import System.Mem
import System.Process.Typed

main :: IO ()
main =
  forM_ [1 .. 10000] $ \i -> do
    withProcessTerm (setStdin createPipe $ setStdout createPipe $ setStderr inherit $ proc "cat" []) $ \ph -> do
      putStrLn $ "Iteration: " <> show i
      let inHandle = getStdin ph
      let outHandle = getStdout ph
      let contents = "Hello world\n"
      SB.hPut inHandle contents
      hFlush inHandle
      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

I couldn't reproduce the failure without the interaction with the standard streams, so I may very well be doing something wrong, but every 200 iterations or so I see this:

Iteration: 134
Iteration: 135
Iteration: 136
repro.hs: waitForProcess: does not exist (No child processes)

NorfairKing commented 3 years ago

I tried reproducing this problem with process (without typed-process) but could not reproduce the problem like this:

#!/usr/bin/env stack
-- stack --resolver lts-16.31 script

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import qualified Data.ByteString as SB
import System.IO
import System.Mem
import System.Process

main :: IO ()
main =
  forM_ [1 .. 10000] $ \i -> do
    withCreateProcess ((proc "cat" []) {std_in = CreatePipe, std_out = CreatePipe, std_err = Inherit}) $ \(Just inHandle) (Just outHandle) _ ph -> do
      putStrLn $ "Iteration: " <> show i
      let contents = "Hello world\n"
      SB.hPut inHandle contents
      hFlush inHandle
      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

NorfairKing commented 3 years ago

This is pretty easy to turn into a failing test. I wouldn't mind making a PR for that if it would help!

snoyberg commented 3 years ago

I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin in the same thread that reads from stdout, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.

Reproducing case would be appreciated. I'll see if something jumps out at me in the code.

snoyberg commented 3 years ago

Also, it makes sense you can't reproduce with that process example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).

snoyberg commented 3 years ago

Strangely enough, making this seemingly innocuous change seems to fix the problem on my system, can you confirm?

@@ -768,6 +768,7 @@ startProcess pConfig'@ProcessConfig {..} = liftIO $ do
                       -- then call waitForProcess ourselves
                       Left _ -> do
                           eres <- try $ P.terminateProcess pHandle
+                          P.getProcessExitCode pHandle
                           ec <-
                             case eres of
                               Left e

My theory: there's a subtle race condition in process where it's not fulfilling its responsibilities around waitForProcess being idempotent in some cases. Calling getProcessExitCode first forces the MVar to end up in a valid state, making the later call to waitForProcess succeed reliably.

snoyberg commented 3 years ago

The bug is definitely in process, and it's an async exception bug. It's exactly the kind of bug that typed-process is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process.

Issue: we want waitForProcess to be interruptible. If it gets interrupted, we never get a chance to see the exit code. But if the waitpid system call succeeds, then the process is removed from the process table and can never be queried again.

In our case, the polling thread in typed-process is calling waitForProcess and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess isn't actually async-exception-safe, none of this works.

I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.

NorfairKing commented 3 years ago

I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin in the same thread that reads from stdout, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.

OH, good point! I hope this isn't an issue in this simple case. I just needed some example of interacting with an external process so I figured cat was the simplest example but I'm open to other suggestions.

Also, it makes sense you can't reproduce with that process example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).

Wait what? I just had another look at the source and withCreateProcess calls cleanupProcess in a bracket (http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#withCreateProcess) and cleanupProcess calls terminateProcess (http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#cleanupProcess). Am I missing something?

My theory: there's a subtle race condition in process where it's not fulfilling its responsibilities around waitForProcess being idempotent in some cases.

Yes that seems to match my intuition as well.

The bug is definitely in process, and it's an async exception bug. It's exactly the kind of bug that typed-process is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process. I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.

Ah shoot, that's what I was afraid of. To be quite honest, I'm not convinced that I fully grasp what's going on here, but I definitely expected my example to "just work" or at least not crash. (I'm using a fancy withX function afterall..)

Whatever we do to solve this issue, could we please make sure to have a regression test for this behaviour if we decide that this example shouldn't crash? I'd be happy to write the necessary tests.

(P.S. I still don't understand why my process example doesn't exhibit the problem.)

nh2 commented 3 years ago

Side note: This

      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

need not work. As far as I can tell, hGet reads "up to n" bytes, so output may well be Hello wo if the kernel decides so.

You need SB.hGetContents if you want to loop-read until the end.

Also nice repro!

bitc commented 2 years ago

I'm not sure I fully understand the problem, but would this idea work?

When running a process, immediately do an internal forkIO that just calls the waitpid system call. Then this mini-thread will never be interrupted directly with an async exception (and it can be internally interacted with via STM or MVars)

snoyberg commented 2 years ago

See my comment above

In our case, the polling thread in typed-process is calling waitForProcess and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess isn't actually async-exception-safe, none of this works.

sol commented 1 year ago

This is the same as #69.

What's happening here @NorfairKing with your original example is that:

hGet returns with some, but not all of the output (as @nh2 already pointed out)
when (output /= contents) $ fail "failed to read." will result in an exception
We mess up in pCleanup, erroneously calling waitForProcess twice. This results in a second exception does not exist (No child processes), which masks the first one, resulting in the bug.

Or in other words: The bug is not that you get an exception in the first place, the bug is that you get the wrong exception.

A minimal reproducing example is at: https://github.com/fpco/typed-process/issues/69#issue-1705467406

Exactly how we mess up in pCleanup I described in great detail here: https://github.com/fpco/typed-process/pull/70#issue-1705495438

sol commented 1 year ago

However, because waitForProcess isn't actually async-exception-safe, none of this works.

@snoyberg we essentially have the following code (ignoring the fallback code for the single-threaded runtime):

ec <- unmask $ do
    P.waitForProcess pHandle
    -- Do we have any guarantee that we are not interrupted right here?
atomically $ putTMVar pExitCode ec

Do we have anything that guarantees us that we are not interrupted right after waitForProcess returns? If not, then I think even if waitForProcess were perfectly async-exception-safe, we would still have a bug, as at this point we lost the exit code: We successfully retrieved it from waitForProcess, but we failed to put it into pExitCode. Leaving pExitCode empty.

sol commented 1 year ago

If not, then I think even if waitForProcess were perfectly async-exception-safe, we would still have a bug

Actually no, if waitForProcess were async-exception-safe and idempotent then the current code would probably work (modulo handling of delegate_ctlc, which I think would still be broken). Somehow this slipped my attention. I naively assumed that waitForProcess behaves like waitpid here.

So, as I understand it, we have two issues:

waitForProcess is not well behaved when it gets interrupted. It is documented to be idempotent, but it loses that property when subjected to async-exceptions.
Just by looking at the type of pExitCode :: TMVar ExitCode I think it's evident that we have another bug. With this type waitExitCode can only ever return a value, never propagate an exception. Now, what value should waitExitCode return when there is no value? When waitForProcess fails with an exception (as it does when delegate_ctlc = True on ctrl-c)? It will diverge, a.k.a. deadlock.

For (1) is there an important reason that we have to interrupt waitForProcess in the first place? If yes, for what exact reason? Can we name it?

Otherwise, how about simply not to interrupt waitForProcess and leave all of the heavy lifting to async:

terminate the process with terminateProcess at which point waitForProcess will return on its own
use Async.wait waitingThread to get the exit code

For (2), we could change the type of pExitCode to TMVar (Either SomeException ExitCode) to allow waitExitCode to propagate exceptions (specifically UserInterupt when delegate_ctlc = True) instead of deadlocking. But then again, async already provides exactly that functionality. So how about relying on async yet again?

70 addresses only (1)
74 addresses both (2) on top of #70

fpco / typed-process

Race condition in `waitForProcess` #38

70 addresses only (1)

74 addresses both (2) on top of #70