Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

TempFile leaks temporary files on Windows #37589

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR38616
Status NEW
Importance P normal
Reported by Robbert Haarman (llvm@inglorion.net)
Reported on 2018-08-17 12:00:46 -0700
Last modified on 2021-01-22 08:03:46 -0800
Version trunk
Hardware PC Windows NT
CC llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, rnk@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR38467

Files created by TempFile::create() are supposed to be removed automatically. On Windows, this does not always happen. This can be reproduced by performing a large ThinLTO or CFI build and exiting it with Control+C during linking, or with a small test program that spawns 100 threads that each create a TempFile and then use keep() to rename it to the same final name, and then having the process exit (regular exit will do, but also report_fatal_error, signals, ...) before all threads have finished.

Quuxplusone commented 6 years ago

The problem here is that there are two race conditions in TempFile on Windows. The first occurs between the time the file is created and the time we set the DeleteFile flag. The second occurs when in keep(), we clear the DeleteFile flag, then do a potentially time-consuming rename operation. My test program hits both of these races.

Quuxplusone commented 6 years ago

Short summary: I propose to improve this by making TempFile::keep() do the rename first, and only after that clearing the DeleteFile flag. Longer explanation of findings and why I think this approach is appropriate follows.

There are two race conditions in TempFile that cause us to leak temporary files.

The first race exists between the time we create the file and the time we set the DeleteFile flag in the file disposition info. We set this flag before we write any bytes to the file, so any files leaked this way will have 0 size.

The second race happens when we call TempFile::keep(), which clears the DeleteFile flag, then does a potentially time-consuming rename operation. If we exit before the rename completes, or the rename fails but we exit before setting the DeleteFile flag again, we leak the tempfile. This happens after bytes have been written to the file, so these files have nonzero size.

In both cases, the problem is that there is an operation which we want to be atomic, but there is no way to make it so.

To avoid the open/DeleteFile race, we could specify FILE_DELETE_ON_CLOSE when opening a file. The issue is that this flag cannot be cleared, so we cannot then decide to keep the file instead. One possible way around that is to create a hard link instead of performing a rename - in that case, the tempfile still gets deleted, but the contents are preserved under the target file name. Hard links are documented to only work on NTFS, and I prefer a filesystem-agnostic solution as long as it's one we can live with. My take is that the 0-size files aren't where the biggest part of the problem is. I do have some additional thoughts, but I'd like to focus on the nonzero size files first.

For the DeleteFile/rename race, we could do what we do on non-Windows: rename first, then stop the deletion. This doesn't completely eliminate the race on Windows, but it does narrow the window from a loop that does I/O to just a few memory operations. It also converts the result of losing the race from leaking a tempfile to having the target file disappear. That last bit sounds bad, but I actually think it is a better scenario than we have now. For starters, due to the way rename is implemented, we can already end up in this situation even with the current code. In some cases, this isn't really a big problem at all. For example, in the ThinLTO cache where we're seeing the tempfile leak being a problem, the target file not existing means that it will be rebuilt, so the consequence of losing the race will be that a build takes a bit longer - and only in a situation where we have already paid the price of unnecessarily rebuilding the file (otherwise the target file wouldn't already exist). The other major use case for TempFile is when we attempt to rename a final build product. In that case, losing the race is much less likely, because unlike the ThinLTO cache case, we don't generally have a lot of threads that might cause the process to suddenly exit. We could still lose the raise due to, say, someone pressing Control+C at just the right time - but, again, that's already true with the rename implementation we have.

In summary, if we make the change I'm proposing, we will stop leaking gigabytes of temporary files, in exchange for an small window for missing cache entries we already have a small window of missing, and an expected tiny window for missing final build products that we also already have an expected tiny window for missing.

Quuxplusone commented 6 years ago

Pasting some thoughts, on the run...

The last time we revisited our temp file management strategy on Windows, we considered the case of a thread racily crashing (or exiting) the process during the window between un-marking the file delete on close and the rename succeeding, and declared that that problem was out of scope.

I think we underestimated how long the window is because of the complexity of Windows file renaming and LLVM's complex attempts to move existing files out of the way before doing the rename.

We also considered threaded temp file management to be a bit of an exotic use case, but that's exactly what LLVM ThinLTO does.

Quuxplusone commented 6 years ago

It turns out my idea of moving the rename before the clearing of the DeleteFile flag does not work. The reason is that a file that has the DeleteFile flag set cannot be opened without setting the FILE_SHARE_DELETE mode (attempting to do so results in ERROR_ACCESS_DENIED). This breaks large parts of the LLVM test suite, and presumably also out-of-tree clients.

Quuxplusone commented 3 years ago

Hans, assigning to you since you were recently working on this.

Quuxplusone commented 3 years ago

I'm not planning any work here after https://reviews.llvm.org/D94962