FFI::Temp race among parallel programs

ppisar commented 3 years ago

I observed a test failure like this:

    # Failed test 'name'
    # at t/ffi_build.t line 119.
    # Caught exception in subtest: Error in tempdir() using /builddir/build/BUILD/FFI-Platypus-1.50/.tmp/XXXXXXXXXX: Parent directory (/builddir/build/BUILD/FFI-Platypus-1.50/.tmp) does not exist at /builddir/build/BUILD/FFI-Platypus-1.50/blib/lib/FFI/Temp.pm line 59.
# Failed test 'build'
# at t/ffi_build.t line 122.

I perform FFI-Platypus tests in parallel and that mean that a test can perform FFI::Temp->newdir() while another test performs FFI::Temp END block. There seems to be a race between creating and deleting ./tmp directory:

A process enters File::Temp::_root() and observes the directory already exists and is paused by a CPU scheduler right after the check for the directory, but before creating a lock file:

sub _root
{
  my $root = File::Spec->rel2abs(File::Spec->catdir(".tmp"));
  unless(-d $root)
  {
    mkdir $root or die "unable to create temp root $!";
  }
→ SLEEP HERE
  # TODO: doesn't account for fork...
  my $lock = File::Spec->catfile($root, "l$$");
  unless(-f $lock)
  {
    open my $fh, '>', $lock;
    close $fh;
  }
  $root{$root} = 1;
  $root;
}

Then another process enters END block, deletes all lock files, attempt delete the directory, that succeeds (becaus it is empty) and returns back:

END {
  foreach my $root (keys %root)
  {
    my $lock = File::Spec->catfile($root, "l$$");
    unlink $lock;
    # try to delete if possible.
    # if not possible then punt
    rmdir $root if -d $root;
  }
}
→ REACH HERE

Finally, the first process continues, it attempts to create a lock file, it fails (because the directory does not exist) unnoticed, register the lock file and returns back to FFI::Temp->newdir() which then calls File::Temp->newdir() which reports the missing ./tmp directory specified as DIR argument:

sub newdir
{
  my $class = shift;
  croak "uneven" if @_ % 2;
→ File::Temp->newdir(DIR => _root, @_);
}

I could patch the tests to run serially, but that does not fix the problem in the FFI::Temp library which can happen in any FFI::Platypus user.

An immediate fix is to raise an error on close() failure in _root(). That will prevent from the race, but it won't be robust enough for a smooth user experience. Applications would have to retry FFI::Platypus. Modern operating systems provides a way of creating files relative to a opened directory descriptor. But that would require File::Temp to use it. Once could maybe use file locks as synchronization primitive. And maybe the easiest and portable approach would be retry creating the directory if creating the file lock fails because of a missing directory. With a loop inside _root().

plicease commented 3 years ago

Wasn't initially able to reproduce this, though I do agree there is a race here.

https://github.com/PerlFFI/FFI-Platypus/pull/348

Implements a retry mode, which I think would be the most appropriate fix. Using proper file system locks would almost certainly be the most reliable option, but not all filesystems implement them properly or at all.

There still is a race because we only retry a finite number of times in this PR. The likelihood of a fail is reduced with even a few retries. I don't think there is a reliable way to know why File::Temp failed, otherwise I could key off that.

I am releasing this as dev version FFI-Platypus-1.54_01.tar.gz, I appreciate your feedback on this since I wasn't able to quickly reproduce this race myself.

ppisar commented 3 years ago

Thanks for the quick fix. I'm also unable to reproduce it at will. I tested FFI-Platypus-1.54_01 and it looks good so far.

plicease commented 3 years ago

I'm going to close this for now, but we can re-open if it pops up again.

PerlFFI / FFI-Platypus

FFI::Temp race among parallel programs #344