HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
133 stars 4 forks source link

WISH: Atomic writing to file #20

Open HenrikBengtsson opened 8 years ago

HenrikBengtsson commented 8 years ago

(Adopted from Wiki entry)

Background

When writing to file, there is always the risk that the process is interrupted which may result in an incomplete file. Depending on file format, it can be extremely hard, or even impossible, to detect that the file is incomplete. For instance, if writing a data frame with 100,000 rows to a comma-delimited file using write.csv(), if we're unlucky, the writing may be interrupted at the end of a row, e.g. when 98,953 complete rows have been written. If so, data <- read.csv() will happily read the 98,953 rows and there is no way for us to know that the file is incomplete. Even if it is possible to detect incomplete and/or corrupt files, it can be extremely tedious to identify them.

This is a real problem when generating a large number of files, especially large files for which the risk of being exposed to an interrupt increases.

Suggestion / Wish

If the file are written atomically, that is, either all of the file is there at the end or not at all, then the problem of knowing whether the file is complete or not would not exist. One approach for writing files atomically is to write using a temporary file name and then rename on completion.

Prototype / example

Assume we save the file using saveRDS(x, file="foo.rds", atomic=TRUE). This could in principle be done as:

  1. saveRDS(x, file="foo.rds.tmp")
  2. file.rename("foo.rds.tmp", "foo.rds")

If there is an interrupt, there will be a left-over *.rds.tmp file, but not the final *.rds file. There could be options for automatically cleaning up incomplete files, or renaming the temporary file to, say, *.rds.error if an error was thrown while writing the file.

HenrikBengtsson commented 8 years ago

On 2016-04-01, @gaborcsardi proposed that one could maybe design a specific connection type that does write-to-temporary-file-name-and-rename-when-done for us, e.g.

saveRDS(x, file=atomic("foo.rds"))
write.csv(data, file=atomic("data.csv"))