edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

Cannot overwrite using save.ffdf() #15

Open nalimilan opened 10 years ago

nalimilan commented 10 years ago

If you try to save a new version of a ffdf to a directory where you already saved it, you get a cryptic error:

save.ffdf(df, dir="df")
Error in `filename<-.ff`(`*tmp*`, value = "./df$var.ff") : 
  changing ff filename from '/tmp/RtmpfGCMu8/ffdf6f52333067a6.ff' to '/home/milan/df/df$var.ff' failed

I think it would be nice to add an overwrite parameter that would allow forcing this behavior. It could even default to TRUE, since save() already works that way.

Though there's still the issue that if some variables are missing for the new version of the ffdf, some files will be left in the directory. So maybe the parameter should be called overwrite.dir or erase.dir so that we can simply remove all the contents of the directory and start with a clean one.

edwindj commented 10 years ago

I will look into this one and try to fix it, but it is not as simple as it looks:

  1. If the directory does not exist, then there is no problem.
  2. If the directory does exist, it may be that one of the ff vectors to be stored is already stored in that directory. This happens for example when you add a new column to an already saved ffdf and save it again in the same directory.

solution is to a) save the ffdf in a directory "dirname_ffbase_tmp", b) remove directory "dirname" and c) rename "dirname_ffbase_tmp" into "dirname". However it is possible that b) fails because not all files in that directory are closed. In that case I think ffbase should warn that the result has be saved in 'dirname_ffbase_tmp".

Any suggestions for improvement?

nalimilan commented 10 years ago

Good catch. How does it work currently when one of the ff vectors is already saved under the same name in the directory? Does ff handle this automatically?

A possibly more robust solution would be to move files instead of the whole directory. I.e. save the ffdf to a temporary directory, then move files one by one to the destination dir, and remove the files that were present there before (and were not replaced during the move). That way, if some files are open and thus cannot be removed, you only issue a warning listing the files, but the saved ffdf is guaranteed to work fine.

This leave the problem of what happens when trying to overwrite an ff vector that is still open, but since this works currently it should continue to work (does it?).

nalimilan commented 10 years ago

Why did you close this issue? Don't you think we can do something about this?

edwindj commented 10 years ago

I made a fix for issue #30 and uploaded a new version to CRAN (ffbase: 0.11.1) I think it solves this issue too, but please correct me if I'm wrong

nalimilan commented 10 years ago

I just tested using latest Github code and it still fails...

edwindj commented 10 years ago

Do you have a small testscript so I can (automagically) check this issue? It seems that the file in the directory is still open, but from your script I cannot detect why this is the case.

nalimilan commented 10 years ago

That's actually very easy:

library(ffbase)
data(iris)
a <- as.ffdf(iris)
b <- as.ffdf(iris)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/")
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/")
Erreur dans `filename<-.ff`(`*tmp*`, value = "./c$Sepal.Length.ff") : 
  changing ff filename from '/tmp/RtmpflTF3K/ffdf3e162c215982.ff' to '/home/milan/test/ff/c$Sepal.Length.ff' failed
edwindj commented 10 years ago

Hm, weird, this runs on (one of) my machines without problems (note the param overwrite=T)

unlink("test/ff", recursive=T, force=T)
library(ffbase)
data(iris)
a <- as.ffdf(iris)
b <- as.ffdf(iris)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/", overwrite=T)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/", overwrite=T)
nalimilan commented 10 years ago

Even with overwrite = TRUE, it fails here. Apparently, this is due to the fact that my temp directory and the directory where I try to save to are on different partitions. If I change my working directory to the temp dir, it works.

donboyd5 commented 10 years ago

Hi,

I am new to this, so please excuse if I have not provided enough information. If you tell me what I did wrong, I'll fix it. I think I am having the same problem. I have been using ff and ffbase for about a week. I am really impressed with both.

I have created a large ffdf (500+ million rows) with multiple calls to ffdfappend. My working directory, where it seems to save the associated files, is on a fast but relatively small SSD (240gb) drive. I want to save the permanent version on my larger slower cheaper permanent SATA drive (2tb). I have done that with save.ffdf(...,overwrite=TRUE) and it works fine. Then I add a variable or two, and want to save again with save.ffdf(...,overwrite=TRUE) but I it will not let me save. I tried to trick it by deleting the directory on the SATA drive from the operating system but it will not let me delete. I tried to remove or "free up" the directory from within R, guessing at delete and also close.ffdf but neither deletes the directory from within R and neither allows me to delete the directory from within the operating system.

Can you advise how I can either overwrite an existing ffdf in this situation (where temporary directory is on one drive and persistent files will be stored on another drive), or advise on a possible workaround (such as a way to delete the directory when it seems to be locked by the operating system)? (If you need more information or scaled down version of nonworking code I would be happy to provide it, although I think it would look much like the example above except in my case I would create an ffdf (with files being saved to C: disk by default - SSD), save.ffdf to E: disk (SATA), add variable, try save.ffdf again (to E: -- SATA), and fail.

Many thanks in advance.

Don

edwindj commented 10 years ago

Hi Don,

Sorry to hear about your problem: I haven't been able to reproduce your problem (yet), so a reproducible script would be nice. What OS are you using? (I have Ubuntu and Windows 32/64 machines, but no OSX). I'm not sure whether it is a ff error or a ffbase error.

Possible workarounds:

-try saving it in another directory :-), unlink the original directory and rename the directory... -Or, use the pack.ffdf and 'unpack.ffdf' functions to store your ffdf data on the slower disk. They will let you restore the file on the faster disk.

Good luck!

donboyd5 commented 10 years ago

Hi Edwin,

I misdiagnosed the problem. After careful testing, I see that the problem has nothing to do with the 2 separate drives. The problem - regardless of which drive I use - is that I cannot save.ffdf a second time, AFTER having done a sort. Unlinking and/or cloning does not seem to resolve the problem. I tried to read the save.ffdf documentation carefully but did not see anything in there on this, so I apologize if it is there and I am missing it.

I copy below barebones code that causes the problem on my machine, followed by R output from running the code. I begin with the directory c:\test existing, but with nothing in it. The error message after sorting and then trying to save is:

"Error in filename<-.ff(*tmp*, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' "

If you have further advice I'd appreciate it. Thanks.

Don

library(ffbase) data(iris) a <- as.ffdf(iris) b <- as.ffdf(iris) getOption("fftempdir")

save on the c: drive

c1 <- rbind(a, b) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving a second time after doing nothing works idx<-ffdforder(c1[c("Sepal.Length","Species")]) c1<-c1[idx, ] save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving fails after doing a sort save.ffdf(c1, dir="c:\test\ff\", overwrite=T, clone=TRUE) # cloning fails, too unlink("c:\test\ff", recursive=T, force=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # still fails

data(iris) a <- as.ffdf(iris) b <- as.ffdf(iris)

getOption("fftempdir") [1] "C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp"

save on the c: drive

c1 <- rbind(a, b) save.ffdf(c1, dir="c:\test\ff\", overwrite=T)

save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving a second time after doing nothing works

idx<-ffdforder(c1[c("Sepal.Length","Species")]) opening ff c:/test/ff/c1$Species.ff opening ff c:/test/ff/c1$Sepal.Length.ff c1<-c1[idx, ] opening ff c:/test/ff/c1$Sepal.Width.ff opening ff c:/test/ff/c1$Petal.Length.ff opening ff c:/test/ff/c1$Petal.Width.ff save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving fails after doing a sort Error in filename<-.ff(*tmp*, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failed

save.ffdf(c1, dir="c:\test\ff\", overwrite=T, clone=TRUE) # cloning fails, too Error in filename<-.ff(*tmp*, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf2384725e665e.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failed

unlink("c:\test\ff", recursive=T, force=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # still fails Error in filename<-.ff(*tmp*, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failed

donboyd5 commented 10 years ago

Forgot to answer your question: I am using Windows 7 64-bit Professional.

Don

Stageexp commented 10 years ago

I have the exact same issue, but it is not deterministic. Sometimes, I cannot save, sometimes, I can. I add a new column to an existing ffdf. Maybe this is a timing issue, because I normally get the error when I work for a long time without saving.

I am working on two separate systems, Windows 7 and Windows Server 2008 (both 64 bit). I experience the problem on both sides. Adding clone=TRUE is returning the same error message. I cannot just delete the file my_ffdf$newColumn.ff in my explorer because it is being used by RStudio. But if I close RStudio, the temp file is gone :-)

Using "filename" to change the path of the column isn't working either.

Please help us!!! By, the way, you are doing a fantastic job, I don't know how I could handle my data without ffbase :-)

Stageexp commented 10 years ago

Oh, and one more comment: Every time I call save.ffdf, the error message names a differnt .ff file in the temp directory which it wants to move to my desired location.

If I look into the temp file in my explorer, I can see that all the columns from my ffdf are being saved under a new name each time I execute save.ffdf, but it is saved in the temp directory and not moved.

edwindj commented 10 years ago

@donboyd5 Thanks for your reproducable script: I've minimilized it to the following:

library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff/", overwrite=T)
idx <- ff(1L) # we will just select the first row
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open 
save.ffdf(a, dir="c:/test/ff/", overwrite=T) # this fails 

It fails because a_old is in the twilight zone of existence (it has not been garbage collected). It can be fixed by adding gc() before a save.ffdf. @Stageexp : this might also explain why it is not deterministic: it depends on when R decides to call gc() internally.

library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff/", overwrite=T)
idx <- ff(1L) # we will just select the first row
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open 
gc() # garbage collect memory and close old a files.
save.ffdf(a, dir="c:/test/ff/", overwrite=T) # this works!

I will add gc() to save.ffdf so this hack won't be necessary in future scripts. Hope this fix will solve your problems!

Stageexp commented 10 years ago

Hi, I have two separate ffdfs in my workspace at the moment, and had the error message for both. After calling gc(), I could save my younger ffdf, but still not the older one. But I guess that problem will be gone once I am using the new version!

donboyd5 commented 10 years ago

Edwin,

1) Your script is better than mine. I thought I had minimalized it. Obviously not. If I have another issue I'll try to pare it down more. 2) Thanks. I never would have figured that out. Having a simple solution like this will save me a lot of work.

Don

nalimilan commented 10 years ago

Interesting, but the test case I posted above still fails even when calling gc()... :-/

edwindj commented 10 years ago

@nalimilan Sorry to hear that: I'm currently out of clues on the cause of this error and cannot reproduce it...

nalimilan commented 10 years ago

@edwindj Have you tried when saving across different filesystems?

Stageexp commented 10 years ago

Edwin,

I found a reproducible issue for sorting. Using your toy example:

library(ffbase) data(iris) a <- as.ffdf(iris) save.ffdf(a, dir="c:/test/ff", overwrite=T) rm(a) load.ffdf("c:/test/ff") idx <- ffdforder(a[c("Sepal.Length","Sepal.Width")]) a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open gc() # garbage collect memory and close old a files. save.ffdf(a, dir="c:/test/ff", overwrite=T) # this doesn't work :-(

edwindj commented 10 years ago

@Stageexp Thanks for finding that!

At the moment I cannot fix it: a <- a[idx] results in a call to ffdfindexget and this ff function does not close the files and keeps the pointers in memory. So it is a ff error and not a ffbase error. I'm not sure what the best fix would be.

You can work around the issue the following way (and I agree that this is just a work around)

library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff", overwrite=T)
rm(a)
load.ffdf("c:/test/ff")
idx <- ffdforder(a[c("Sepal.Length","Sepal.Width")])
b <- a # create b that point to the same files!
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open 
close(b) # so close the original files
save.ffdf(a, dir="c:/test/ff", overwrite=T) # this does work.
nalimilan commented 10 years ago

Should we ask ff developers about this? Jens has been very helpful when I reported problems to him.

edwindj commented 10 years ago

Good idea! I'm currently a bit busy, so if you are in the opportunity please do so.

A code example without ffbase

library(ff)
data(iris)
a <- as.ffdf(iris)
dir.create("test")
pattern(a) <- "test/")
a <- a[ff(1L), ]
unlink("test", recursive=T, force=T) # fails
nalimilan commented 10 years ago

@edwindj The code above works fine here, even though the ffbase examples fail. Did you forget a step which should make it fail?