Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Windows llvm::sys::fs::remove algorithm is *VERY* subtly incorrect, resulting in unpredictable "directory not empty" errors #27385

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR27386
Status NEW
Importance P normal
Reported by Alexander Riccio (test35965@gmail.com)
Reported on 2016-04-16 15:53:15 -0700
Last modified on 2016-08-22 05:37:49 -0700
Version trunk
Hardware PC Windows NT
CC aaron@aaronballman.com, benny.kra@gmail.com, compnerd@compnerd.org, douglas_yung@playstation.sony.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, paul_robinson@playstation.sony.com, rnk@google.com, test35965@gmail.com, vsk@apple.com, yaron.keren@gmail.com, zturner@google.com
Fixed by commit(s)
Attachments file_27386.txt (4987 bytes, text/plain)
llvm_poor_file_deletion.PNG (54324 bytes, image/png)
file_27386.txt (908 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 16224
The MSBuild output of two tests failing because of this bug. There's a third
failure (tools/llvm-symbolizer/ppc64.test), unrelated to this bug.

Under heavy IO load, removing directories on Windows with llvm::sys::fs:remove
will fail unpredictably.

I've just observed the TestWriteWithSystemEncoding test in
./llvm/unittests/Support/ProgramTest.cpp failing because of this. That test
creates a kinda-unique directory named "program-test-%%%%%%" (where % is a
pseudorandom alphanumeric), creates a file named "international-file.txt" in
it, and roundtrip-tests some text (writes some text with one encoding, reads it
with another, checks if it's not garbled). After that, it deletes the file
(llvm::sys::fs::remove(file_pathname.str())) and then the directory
(llvm::sys::fs::remove(TestDirectory.str())). Essentially, when Windows
"Deletes" a file, it doesn't actually Delete the file; when the system is under
load, "international-file.txt" may STILL be in the directory by the time of
llvm::sys::fs::remove(TestDirectory.str()).

The details of WHY this will fail are very counter-intuitive, and you should
see "CppCon 2015: Niall Douglas “Racing The File System"
(https://youtu.be/uhRWMGBjlO8?t=7m35s) for *specifically* why the code is
wrong. Lots of programs get this wrong, Douglas mentions CPython, and I've seen
it in Chromium.

Sidenote: The documentation for RemoveDirectory (https://msdn.microsoft.com/en-
us/library/windows/desktop/aa365488.aspx) is kinda crappy, because it says
"RemoveDirectory removes a directory junction, even if the contents of the
target are not empty; the function removes directory junctions regardless of
the state of the target object.", but "Creating and Deleting Directories"
(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363872.aspx) says:
"To delete an existing directory, use the RemoveDirectory or
RemoveDirectoryTransacted function. Before removing a directory, you must
ensure that the directory is empty and that you have the delete access
privilege for the directory. To do the latter, call the GetSecurityInfo
function."
Quuxplusone commented 8 years ago

Attached file_27386.txt (4987 bytes, text/plain): The MSBuild output of two tests failing because of this bug. There's a third failure (tools/llvm-symbolizer/ppc64.test), unrelated to this bug.

Quuxplusone commented 8 years ago

Attached llvm_poor_file_deletion.PNG (54324 bytes, image/png): By the time I check the folder, it is empty. Windows "marks" files for deletion, and then later actually unlinks/deletes them. Here, the contents of this directory weren't actually deleted by the time that a unittest tried to delete the directory.

Quuxplusone commented 8 years ago

Note that the function with the bug is in ./llvm/lib/Support/Windows/Path.inc, NOT in ./llvm/lib/Support/Unix/Path.inc

Sidenote: maybe "System Libraries" is the proper component?

Quuxplusone commented 8 years ago

Attached file_27386.txt (908 bytes, text/plain): This is the source code of llvm::sys::fs::remove, for your convenience.

Quuxplusone commented 8 years ago

Maybe bash rm on Windows has the same problem?

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

Quuxplusone commented 8 years ago
(In reply to comment #4)
> Maybe bash rm on Windows has the same problem?
>
> http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

Heh, I wouldn't be surprised. I'm a bit rusty on my shell scripting - what do
the changed lines in that commit do?
Quuxplusone commented 8 years ago
(In reply to comment #4)
> Maybe bash rm on Windows has the same problem?
>
> http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

Heh, I wouldn't be surprised. I'm a bit rusty on my shell scripting - what do
the changed lines in that commit do?
Quuxplusone commented 8 years ago

The test had similar problem to what you describe: a file deleted by rm still existed when the next command is run.

At first some workarounds were tried to avoid removing the file and immediately using it but it still often failed and was disabled on Windows:

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130450.html

The rm command may have the same problem you found in llvm::sys::fs:remove and the other programs mentioned.

Quuxplusone commented 8 years ago

So, is this a bug in remove or in the users?

Should we just change ProgramTest.cpp to not use a directory and document the restriction on remove? Anyone that really needs to delete directories then should use rename + remove.

Quuxplusone commented 8 years ago
If I understand correctly, the problem happens even when erasing one file.
When the system is loaded Windows may defer the actual file deletion.
This may lead to all sorts of problem beside erasing the containing directory
since remove() user expects the file be gone upon return.

Just below Windows\Path.inc remove(), rename() waits in a loop (with sleep of
1ms):

  // Retry while we see recoverable errors.
  // System scanners (eg. indexer) might open the source file when it is written
  // and closed.

  bool TryReplace = true;

  for (int i = 0; i < 2000; i++) {
    if (i > 0)
      ::Sleep(1);

    if (TryReplace) {
...

so remove() may do the same, waiting for the file/path to be actually deleted
before return. Or it can do the rename/delete combo which makes more sense to
avoid the wait loop.
Quuxplusone commented 8 years ago
(In reply to comment #8)
> So, is this a bug in remove or in the users?
>
> Should we just change ProgramTest.cpp to not use a directory and document
> the restriction on remove? Anyone that really needs to delete directories
> then should use rename + remove.

I think I'll try and write a patch to fix this (in llvm::sys::fs::remove). It's
better for the function to just do the right thing, because this issue is
subtle enough to begin with that it's unlikely someone will get it right in
their own implementation.
Quuxplusone commented 8 years ago
So here's what I'm thinking about the patch:

It'll use the same basic algorithm that Niall Douglas used in Boost AFIO v1
(see:
https://github.com/BoostGSoC13/boost.afio/blob/b35ed5b9783a4dfde9b3d611dfb940739f303c97/include/boost/afio/v2/detail/impl/afio_iocp.ipp#L744),
except using the Win32 API instead of the NT kernel/native API.

(separating into two comments so that my text doesn't evaporate)

AFIO:
1. Opens the file with NtCreateFile (with ShareAccess set to FILE_SHARE_DELETE,
CreateOptions set to FILE_DELETE_ON_CLOSE).
2. then (in random_string)  uses the NT kernel/native API RtlGenRandom to
generate a cryptographically random bytebuffer(in random_fill)
3. converts the bytebuffer to a std::string (in to_hex_string)
4. renames the file to the random name with NtSetInformationFile (with
FileInformationClass set to FileRenameInformation)
5. deletes the file by calling NtSetInformationFile with
FileDispositionInformation (where FILE_DISPOSITION_INFORMATION.DeleteFile is
set to true)
Quuxplusone commented 8 years ago
(part 2, what I think the patch will do)

Since LLVM doesn't use the NT kernel/native API, we'll need to do it a little
bit differently.

The biggest difference is that we will use CryptGenRandom, the Win32 public
facing API for generating random bytebuffers - internally it calls RtlGenRandom
-  and it's already used in LLVM (see ./llvm/lib/Support/Windows/Process.inc
function Process::GetRandomNumber).

We will:
1. Open the file with CreateFileW (with dwShareMode set to FILE_SHARE_DELETE,
dwFlagsAndAttributes set to FILE_FLAG_DELETE_ON_CLOSE)
2. Create a 64 character SmallVector<char>
3. Get a handle to a crypto provider with CryptAcquireContextW, store it in a
ScopedCryptContext, and pass that to CryptGenRandom to fill the SmallVector
with random data
4. Rename the file with MoveFileW, passing the SmallVector.data() casted to a
PCWSTR (??)
5. Delete the file with DeleteFileW (not sure why AFIO explicitly deletes the
file, after opening it with FILE_DELETE_ON_CLOSE, I'm looking into this)
6. Close the file handle.

Questions remaining:
1. Should we use  llvm::sys::fs::rename instead of MoveFileW?
2. Should we handle paths longer than "MAX_PATH"? (Which is totally supported,
with a funky syntax)
3. What's going on in AFIO's to_hex_string? All the bit shifting is giving me a
headache. I'm going to ask Niall, the docs say that the bitshifting is part of
packing bytes into the string, but I'm not sure why the bitshifting is
necessary?
4. Do we need to call RemoveDirectoryW to delete an empty directory, or will
DeleteFile do?
Quuxplusone commented 8 years ago

Minor comments, the code needs to keep the UTF16 (widenPath) support, see stuff at end of WindowsSupport.h and maybe sys::fs::createTemporaryFile (or the lower level createUniqueEntity) could be reused to create the temporary filename.

Quuxplusone commented 8 years ago
Maybe I'm missing something, but how about just using the Windows function
whose entire purpose is to recursively remove a directory and all the files in
it?

SHFILEOPSTRUCT FileOp = {0};
FileOp.hwnd = nullptr;
FileOp.wFunc = FO_DELETE;
FileOp.pFrom = <path of directory to delete with double null termination>;
FileOp.pTo = nullptr;
FileOp.fFlags = FOF_NO_UI;
SHFileOperation(&FileOp);
Quuxplusone commented 8 years ago

If we don't care about supporting pre-Vista, then the "recommended" way to do this is through the IFileOperation interface, btw.

Note this is the same code explorer uses when you select a directory and hit Delete, so it definitely is robust under system load.

Quuxplusone commented 8 years ago
(In reply to comment #15)
> If we don't care about supporting pre-Vista, then the "recommended" way to
> do this is through the IFileOperation interface, btw.

We don't care about supporting pre-Vista. We only care about Windows 7 and
later (see WindowsSupport.h:30).
Quuxplusone commented 8 years ago
(In reply to comment #14)
> Maybe I'm missing something, but how about just using the Windows function
> whose entire purpose is to recursively remove a directory and all the files
> in it?
>
> SHFILEOPSTRUCT FileOp = {0};
> FileOp.hwnd = nullptr;
> FileOp.wFunc = FO_DELETE;
> FileOp.pFrom = <path of directory to delete with double null termination>;
> FileOp.pTo = nullptr;
> FileOp.fFlags = FOF_NO_UI;
> SHFileOperation(&FileOp);

Mainly, It's horrifically slow. I'm also not confident that it's robust just
because explorer uses it... I've had issues deleting directory trees in
explorer under heavy load in the past.
Quuxplusone commented 8 years ago

Do you have benchmarks comparing its performance to the existing implementation? I have had probelms deleting from explorer too, but the problems I've had have not been of a racy nature. I've never experienced a problem where if I try to delete something N times it fails the first N-1 times and succeeds the N'th time. The problems have always been "this file appears to be locked for no reason and there's nothing I can to delete it".

Given that IFileOperation is basically the workhorse of file deletion and is what you get every time you press Delete in windows explorer, I have to imagine that Microsoft has tried very hard to optimize it to the best of their ability, and that if it's slower than some other hand-rolled mechanism, there is a good reason for it. Otherwise why wouldn't they use the faster mechanism?

I'm not saying it's impossible they have a poor implementation of IFileOperation, but I think it would be worth doing due diligence and producing some benchmarks to solidify this belief. There's point committing complicated code if simple code will do the job.

Quuxplusone commented 8 years ago
(In reply to comment #18)
> Do you have benchmarks comparing its performance to the existing
> implementation?

Actually, no. Right there, that's enough of a reason for me to agree.

> Given that IFileOperation is basically the workhorse of file deletion and is
> what you get every time you press Delete in windows explorer, I have to
> imagine that Microsoft has tried very hard to optimize it to the best of
> their ability, and that if it's slower than some other hand-rolled
> mechanism, there is a good reason for it.  Otherwise why wouldn't they use
> the faster mechanism?
>
> I'm not saying it's impossible they have a poor implementation of
> IFileOperation, but I think it would be worth doing due diligence and
> producing some benchmarks to solidify this belief.  There's point committing
> complicated code if simple code will do the job.