fpco / unliftio

The MonadUnliftIO typeclass for unlifting monads to IO
Other
151 stars 51 forks source link

Hang in `withSystemTempDirectory` #40

Closed BardurArantsson closed 5 years ago

BardurArantsson commented 5 years ago

This is a bit of a weird one... apologies that I can't provide more information.

I'm seeing hangs in withSystemTempDirectory where the CPU remains pegged at 100%. Unfortunately, I haven't been able to reproduce using a small program, so if possible I'd like advice on how to debug further.

I've got an strace snippet from the call to withSystemTempDirectory:

stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=760, ...}) = 0
lstat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=760, ...}) = 0
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = 1
[...]
timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
rt_sigreturn({mask=[]})                 = 1

AFAICT the SIGVTALRM stuff is from the GHC RTS and not really related to withSystemTempDirecotry.

Any ideas on how to get more debug info that could be of use tracking this down? The problem doesn't reproduce when I try a trivial "main".

(Aside: Would it not be better to use a random number instead of the PID when choosing candidate file names? Given the current implementation a machine-local DoS would be pretty trivial.)

snoyberg commented 5 years ago

The code in unliftio was lifted from the temporary package. Can you check if you have the same problems there?

BardurArantsson commented 5 years ago

So I tried that and unfortunately it still hangs. Here's the relevant part of the strace output:

stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=700, ...}) = 0
lstat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=700, ...}) = 0
select(4, [3], [], NULL, {tv_sec=29, tv_usec=997764}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
select(4, [3], [], NULL, {tv_sec=29, tv_usec=989203}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
select(4, [3], [], NULL, {tv_sec=29, tv_usec=979323}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
[...]
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
select(4, [3], [], NULL, {tv_sec=29, tv_usec=683398}

I notice that there's now a call to select and this time there's no 100% CPU usage. (Though the call to select could be a red herring, of course.)

Very strange. I suppose I might have to run with a patched 'temporary' or unliftio with a few putStrLn sprinkled in to figure out where it's going wrong.

BardurArantsson commented 5 years ago

Ok, so I've tried with an 'instrumented' temporary-1.3, where I have

withSystemTempDirectory :: (MonadIO m, MC.MonadMask m) =>
                           String   -- ^ Directory name template
                        -> (FilePath -> m a) -- ^ Callback that can use the directory
                        -> m a
withSystemTempDirectory template action = do
  liftIO $ putStrLn "withSystemTempDirectory!"
  tmpDir <- liftIO getCanonicalTemporaryDirectory
  liftIO $ putStrLn $ "tmpDir: " ++ show tmpDir
  withTempDirectory tmpDir template action

withTempDirectory :: (MC.MonadMask m, MonadIO m) =>
                     FilePath -- ^ Parent directory to create the directory in
                  -> String   -- ^ Directory name template
                  -> (FilePath -> m a) -- ^ Callback that can use the directory
                  -> m a
withTempDirectory targetDir template f = do
  liftIO $ putStrLn "withTempDirectory called!"
  MC.bracket
    (liftIO (do { putStrLn "OPEN!"; createTempDirectory targetDir template }))
    (liftIO . ignoringIOErrors . removeDirectoryRecursive)
    f

Here's the output I get:

withSystemTempDirectory!
tmpDir: "/tmp"
withTempDirectory called!

so it appears that the "open" bit of the bracket actually never gets called. It hangs somewhere in between the putStrLn "withTempDirectory called!" and the "open" of the bracket.

I'll try with a vendored unliftio some time in the next few days.

(Btw, the createTempDirectory function in temporary-1.3 does use random numbers for the file name, so it might be worth updating the implementation in unliftio.)

snoyberg commented 5 years ago

What Monad are you running this in? Does it occur in plain IO? From what I can tell in this output, it's spinning inside of bracket itself.

BardurArantsson commented 5 years ago

It's a simple newtype wrapper around a ReaderT, but I am using DerivingVia (which is rather new...) to derive all the instances. I'll try to write the instances manually instead (should be trivial) to see if that changes anything.

As I say, I haven't been able to reproduce using simple examples, but rewriting my code to use IO would be... cumbersome.

BardurArantsson commented 5 years ago

This is quite interesting. My next experiment was:

... and suddenly everything works again! Next, I'll try to see if I can isolate the issue to any particular DerivingVia instance.

(It's increasingly looking like a problem with DerivingVia.)

BardurArantsson commented 5 years ago

OK, so I've found the issue and it's a bit embarrassing, honestly.

In my eagerness to use DerivingVia, I had the following code:

newtype EnvironmentT e m a = EnvironmentT { unEnvironmentT :: ReaderT e m a }
  deriving MonadTrans via ReaderT e
  deriving (Functor, Applicative, Monad) via ReaderT e m
  deriving MonadIO via ReaderT e m
  deriving MonadUnliftIO via EnvironmentT e m

The astute reader will notice that one of those deriving lines is not quite like the other. That last line just derives the MonadUnliftIO instance for EnvironmentT from... itself -- hence it just loops indefinitely at runtime. D'oh!

If I fix it to ReaderT I get the following error:

? Couldn't match representation of type ?m (Control.Monad.IO.Unlift.UnliftIO
                                              (EnvironmentT e m))?
                           with that of ?m (Control.Monad.IO.Unlift.UnliftIO (ReaderT e m))?
    arising from the coercion of the method ?askUnliftIO?
      from type ?ReaderT
                   e m (Control.Monad.IO.Unlift.UnliftIO (ReaderT e m))?
           to type ?EnvironmentT
                   e m (Control.Monad.IO.Unlift.UnliftIO (EnvironmentT e m))?
  NB: We cannot know what roles the parameters to ?m? have;
    we must assume that the role is nominal
? When deriving the instance for (MonadUnliftIO (EnvironmentT e m))

which I take to mean that the instance cannot be derived automatically (perhaps due to the way askUnliftIO uses its parameters?). I'll write the instance manually for now. (I wonder if it could be made to work, though.)

Sorry to have wasted your time.

snoyberg commented 5 years ago

No worries, I'm glad it worked out. This isn't the first time I've seen a bug pop up because of something like this :)

snoyberg commented 5 years ago

Thinking about this a bit more, the fact that this compiled without a warning or error seems like a limitation in the DerivingVia implementation. Would you consider opening up a GHC issue for this?

BardurArantsson commented 5 years ago

I actually briefly considered it, but that was more in the context where I thought it might actually be a bona fide bug in DerivingVia.

It seems like such an obvious thing to warn on a self-recursive definition, but I'm guessing it actually gets really complicated when various extensions are involved. My layman's guess would be that it's actually undecidable whether there's self-recursion. Regardless, I'll try to find time during the week to raise it on the GHC issue tracker as a potential issue. That way, we can get all the crazy smart people thinking about it, at least.

snoyberg commented 5 years ago

I like nerd sniping the smart people like that ;)