Open nalimilan opened 10 years ago
I will look into this one and try to fix it, but it is not as simple as it looks:
ff
vectors to be stored is already stored in that directory. This happens for example when you add a new column to an already saved ffdf
and save it again in the same directory.solution is to a) save the ffdf
in a directory "dirname_ffbase_tmp", b) remove directory "dirname" and c) rename "dirname_ffbase_tmp" into "dirname".
However it is possible that b) fails because not all files in that directory are closed. In that case I think ffbase should warn that the result has be saved in 'dirname_ffbase_tmp".
Any suggestions for improvement?
Good catch. How does it work currently when one of the ff vectors is already saved under the same name in the directory? Does ff handle this automatically?
A possibly more robust solution would be to move files instead of the whole directory. I.e. save the ffdf to a temporary directory, then move files one by one to the destination dir, and remove the files that were present there before (and were not replaced during the move). That way, if some files are open and thus cannot be removed, you only issue a warning listing the files, but the saved ffdf is guaranteed to work fine.
This leave the problem of what happens when trying to overwrite an ff vector that is still open, but since this works currently it should continue to work (does it?).
Why did you close this issue? Don't you think we can do something about this?
I made a fix for issue #30 and uploaded a new version to CRAN (ffbase: 0.11.1) I think it solves this issue too, but please correct me if I'm wrong
I just tested using latest Github code and it still fails...
Do you have a small testscript so I can (automagically) check this issue? It seems that the file in the directory is still open, but from your script I cannot detect why this is the case.
That's actually very easy:
library(ffbase)
data(iris)
a <- as.ffdf(iris)
b <- as.ffdf(iris)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/")
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/")
Erreur dans `filename<-.ff`(`*tmp*`, value = "./c$Sepal.Length.ff") :
changing ff filename from '/tmp/RtmpflTF3K/ffdf3e162c215982.ff' to '/home/milan/test/ff/c$Sepal.Length.ff' failed
Hm, weird, this runs on (one of) my machines without problems (note the param overwrite=T)
unlink("test/ff", recursive=T, force=T)
library(ffbase)
data(iris)
a <- as.ffdf(iris)
b <- as.ffdf(iris)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/", overwrite=T)
c <- rbind(a, b)
save.ffdf(c, dir="test/ff/", overwrite=T)
Even with overwrite = TRUE
, it fails here. Apparently, this is due to the fact that my temp directory and the directory where I try to save to are on different partitions. If I change my working directory to the temp dir, it works.
Hi,
I am new to this, so please excuse if I have not provided enough information. If you tell me what I did wrong, I'll fix it. I think I am having the same problem. I have been using ff and ffbase for about a week. I am really impressed with both.
I have created a large ffdf (500+ million rows) with multiple calls to ffdfappend. My working directory, where it seems to save the associated files, is on a fast but relatively small SSD (240gb) drive. I want to save the permanent version on my larger slower cheaper permanent SATA drive (2tb). I have done that with save.ffdf(...,overwrite=TRUE) and it works fine. Then I add a variable or two, and want to save again with save.ffdf(...,overwrite=TRUE) but I it will not let me save. I tried to trick it by deleting the directory on the SATA drive from the operating system but it will not let me delete. I tried to remove or "free up" the directory from within R, guessing at delete and also close.ffdf but neither deletes the directory from within R and neither allows me to delete the directory from within the operating system.
Can you advise how I can either overwrite an existing ffdf in this situation (where temporary directory is on one drive and persistent files will be stored on another drive), or advise on a possible workaround (such as a way to delete the directory when it seems to be locked by the operating system)? (If you need more information or scaled down version of nonworking code I would be happy to provide it, although I think it would look much like the example above except in my case I would create an ffdf (with files being saved to C: disk by default - SSD), save.ffdf to E: disk (SATA), add variable, try save.ffdf again (to E: -- SATA), and fail.
Many thanks in advance.
Don
Hi Don,
Sorry to hear about your problem: I haven't been able to reproduce your problem (yet), so a reproducible script would be nice. What OS are you using? (I have Ubuntu and Windows 32/64 machines, but no OSX).
I'm not sure whether it is a ff
error or a ffbase
error.
Possible workarounds:
-try saving it in another directory :-), unlink the original directory and rename the directory...
-Or, use the pack.ffdf
and 'unpack.ffdf' functions to store your ffdf
data on the slower disk. They will let you restore the file on the faster disk.
clone=TRUE
in save.ffdf: this will first clone
the loaded ffdf and then save it.Good luck!
Hi Edwin,
I misdiagnosed the problem. After careful testing, I see that the problem has nothing to do with the 2 separate drives. The problem - regardless of which drive I use - is that I cannot save.ffdf a second time, AFTER having done a sort. Unlinking and/or cloning does not seem to resolve the problem. I tried to read the save.ffdf documentation carefully but did not see anything in there on this, so I apologize if it is there and I am missing it.
I copy below barebones code that causes the problem on my machine, followed by R output from running the code. I begin with the directory c:\test existing, but with nothing in it. The error message after sorting and then trying to save is:
"Error in filename<-.ff
(*tmp*
, value = "./c1$Sepal.Length.ff") :
changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' "
If you have further advice I'd appreciate it. Thanks.
Don
library(ffbase) data(iris) a <- as.ffdf(iris) b <- as.ffdf(iris) getOption("fftempdir")
c1 <- rbind(a, b) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving a second time after doing nothing works idx<-ffdforder(c1[c("Sepal.Length","Species")]) c1<-c1[idx, ] save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving fails after doing a sort save.ffdf(c1, dir="c:\test\ff\", overwrite=T, clone=TRUE) # cloning fails, too unlink("c:\test\ff", recursive=T, force=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # still fails
data(iris) a <- as.ffdf(iris) b <- as.ffdf(iris)
getOption("fftempdir") [1] "C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp"
save on the c: drive
c1 <- rbind(a, b) save.ffdf(c1, dir="c:\test\ff\", overwrite=T)
save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving a second time after doing nothing works
idx<-ffdforder(c1[c("Sepal.Length","Species")]) opening ff c:/test/ff/c1$Species.ff opening ff c:/test/ff/c1$Sepal.Length.ff c1<-c1[idx, ] opening ff c:/test/ff/c1$Sepal.Width.ff opening ff c:/test/ff/c1$Petal.Length.ff opening ff c:/test/ff/c1$Petal.Width.ff save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # saving fails after doing a sort Error in
filename<-.ff
(*tmp*
, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failedsave.ffdf(c1, dir="c:\test\ff\", overwrite=T, clone=TRUE) # cloning fails, too Error in
filename<-.ff
(*tmp*
, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf2384725e665e.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failedunlink("c:\test\ff", recursive=T, force=T) save.ffdf(c1, dir="c:\test\ff\", overwrite=T) # still fails Error in
filename<-.ff
(*tmp*
, value = "./c1$Sepal.Length.ff") : changing ff filename from 'C:/Users/Don/AppData/Local/Temp/RtmpyUoCbp/ffdf238450333589.ff' to 'c:/test/ff/c1$Sepal.Length.ff' failed
Forgot to answer your question: I am using Windows 7 64-bit Professional.
Don
I have the exact same issue, but it is not deterministic. Sometimes, I cannot save, sometimes, I can. I add a new column to an existing ffdf. Maybe this is a timing issue, because I normally get the error when I work for a long time without saving.
I am working on two separate systems, Windows 7 and Windows Server 2008 (both 64 bit). I experience the problem on both sides. Adding clone=TRUE is returning the same error message. I cannot just delete the file my_ffdf$newColumn.ff in my explorer because it is being used by RStudio. But if I close RStudio, the temp file is gone :-)
Using "filename" to change the path of the column isn't working either.
Please help us!!! By, the way, you are doing a fantastic job, I don't know how I could handle my data without ffbase :-)
Oh, and one more comment: Every time I call save.ffdf, the error message names a differnt .ff file in the temp directory which it wants to move to my desired location.
If I look into the temp file in my explorer, I can see that all the columns from my ffdf are being saved under a new name each time I execute save.ffdf, but it is saved in the temp directory and not moved.
@donboyd5 Thanks for your reproducable script: I've minimilized it to the following:
library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff/", overwrite=T)
idx <- ff(1L) # we will just select the first row
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open
save.ffdf(a, dir="c:/test/ff/", overwrite=T) # this fails
It fails because a_old is in the twilight zone of existence (it has not been garbage collected).
It can be fixed by adding gc()
before a save.ffdf
. @Stageexp : this might also explain why it is not deterministic: it depends on when R decides to call gc()
internally.
library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff/", overwrite=T)
idx <- ff(1L) # we will just select the first row
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open
gc() # garbage collect memory and close old a files.
save.ffdf(a, dir="c:/test/ff/", overwrite=T) # this works!
I will add gc()
to save.ffdf
so this hack won't be necessary in future scripts.
Hope this fix will solve your problems!
Hi, I have two separate ffdfs in my workspace at the moment, and had the error message for both. After calling gc(), I could save my younger ffdf, but still not the older one. But I guess that problem will be gone once I am using the new version!
Edwin,
1) Your script is better than mine. I thought I had minimalized it. Obviously not. If I have another issue I'll try to pare it down more. 2) Thanks. I never would have figured that out. Having a simple solution like this will save me a lot of work.
Don
Interesting, but the test case I posted above still fails even when calling gc()
... :-/
@nalimilan Sorry to hear that: I'm currently out of clues on the cause of this error and cannot reproduce it...
@edwindj Have you tried when saving across different filesystems?
Edwin,
I found a reproducible issue for sorting. Using your toy example:
library(ffbase) data(iris) a <- as.ffdf(iris) save.ffdf(a, dir="c:/test/ff", overwrite=T) rm(a) load.ffdf("c:/test/ff") idx <- ffdforder(a[c("Sepal.Length","Sepal.Width")]) a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open gc() # garbage collect memory and close old a files. save.ffdf(a, dir="c:/test/ff", overwrite=T) # this doesn't work :-(
@Stageexp Thanks for finding that!
At the moment I cannot fix it: a <- a[idx]
results in a call to ffdfindexget
and this ff
function does not close the files and keeps the pointers in memory. So it is a ff
error and not a ffbase
error.
I'm not sure what the best fix would be.
You can work around the issue the following way (and I agree that this is just a work around)
library(ffbase)
data(iris)
a <- as.ffdf(iris)
save.ffdf(a, dir="c:/test/ff", overwrite=T)
rm(a)
load.ffdf("c:/test/ff")
idx <- ffdforder(a[c("Sepal.Length","Sepal.Width")])
b <- a # create b that point to the same files!
a <- a[idx,] # this creates a new ffdf frame in the temp dir, but the original files are still open
close(b) # so close the original files
save.ffdf(a, dir="c:/test/ff", overwrite=T) # this does work.
Should we ask ff developers about this? Jens has been very helpful when I reported problems to him.
Good idea! I'm currently a bit busy, so if you are in the opportunity please do so.
A code example without ffbase
library(ff)
data(iris)
a <- as.ffdf(iris)
dir.create("test")
pattern(a) <- "test/")
a <- a[ff(1L), ]
unlink("test", recursive=T, force=T) # fails
@edwindj The code above works fine here, even though the ffbase examples fail. Did you forget a step which should make it fail?
If you try to save a new version of a ffdf to a directory where you already saved it, you get a cryptic error:
I think it would be nice to add an
overwrite
parameter that would allow forcing this behavior. It could even default to TRUE, sincesave()
already works that way.Though there's still the issue that if some variables are missing for the new version of the ffdf, some files will be left in the directory. So maybe the parameter should be called
overwrite.dir
orerase.dir
so that we can simply remove all the contents of the directory and start with a clean one.