edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

ffdfdply throws error on example code #37

Closed schuemie closed 10 years ago

schuemie commented 10 years ago

When I run the example code provided for ffdfdply it throws an error

data(iris)
ffiris <- as.ffdf(iris)

youraggregatorFUN <- function(x){
    dup <- duplicated(x[c("Species", "Petal.Width")])
  o <- order(x$Petal.Width)
  lowest_pw <- x[rev(o),][!dup,]
  highest_pw <- x[o,][!dup,]
  lowest_pw$group <- factor("lowest", levels=c("lowest", "highest"))
  highest_pw$group <- factor("highest", levels=c("lowest", "highest"))
    rbind(lowest_pw, highest_pw)
}
result <- ffdfdply( x = ffiris, split = ffiris$Species,
                   FUN = function(x) youraggregatorFUN(x),
                   BATCHBYTES = 5000, trace=TRUE)

Output:

2014-07-03 04:40:55, calculating split sizes
2014-07-03 04:40:55, building up split locations
2014-07-03 04:40:55, working on split 1/2, extracting data in RAM of 2 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
Error in ffindexorder(index, os$b) : 
  cannot allocate memory block of size 67108864 Tb
In addition: Warning message:
In bbatch(length, BATCHBYTES/(recvalbytes + 2 * recindbytes)) :
  NAs introduced by coercion

I have the latest version of R, ff, and ffbase installed:

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ffbase_0.11.3 ff_2.2-13     bit_1.1-12   

loaded via a namespace (and not attached):
[1] fastmatch_1.0-4 tools_3.1.0    

A potential cause is the amount of RAM in my system: 163790MB, which is probably more than most people have. On another machine with only 32GB of RAM but otherwise the same configuration the problem does not occur.

edwindj commented 10 years ago

Thanks for reporting! I will look into it coming week.

Best regards,

Edwin Op 3 jul. 2014 10:49 schreef "Martijn Schuemie" notifications@github.com:

When I run the example code provided for ffdfdply it throws an error

data(iris) ffiris <- as.ffdf(iris)

youraggregatorFUN <- function(x){ dup <- duplicated(x[c("Species", "Petal.Width")]) o <- order(x$Petal.Width) lowest_pw <- x[rev(o),][!dup,] highest_pw <- x[o,][!dup,] lowest_pw$group <- factor("lowest", levels=c("lowest", "highest")) highest_pw$group <- factor("highest", levels=c("lowest", "highest")) rbind(lowest_pw, highest_pw)} result <- ffdfdply( x = ffiris, split = ffiris$Species, FUN = function(x) youraggregatorFUN(x), BATCHBYTES = 5000, trace=TRUE)

Output:

2014-07-03 04:40:55, calculating split sizes 2014-07-03 04:40:55, building up split locations 2014-07-03 04:40:55, working on split 1/2, extracting data in RAM of 2 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB Error in ffindexorder(index, os$b) : cannot allocate memory block of size 67108864 Tb In addition: Warning message: In bbatch(length, BATCHBYTES/(recvalbytes + 2 * recindbytes)) : NAs introduced by coercion

I have the latest version of R, ff, and ffbase installed:

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ffbase_0.11.3 ff_2.2-13 bit_1.1-12

loaded via a namespace (and not attached): [1] fastmatch_1.0-4 tools_3.1.0

A potential cause is the amount of RAM in my system: 163790MB, which is probably more than most people have. On another machine with only 32GB of RAM but otherwise the same configuration the problem does not occur.

— Reply to this email directly or view it on GitHub https://github.com/edwindj/ffbase/issues/37.

schuemie commented 10 years ago

Some more testing seems to confirm the problem is the available memory. If I start R with

Rgui --max-mem-size=50M

the example code runs just fine. I also found a similar problem with the ffdfindexget function. Running this (without restricting memory)

  myVec = ff(1:5)
  another = ff(10:14)
  littleFrame = ffdf(myVec, another)
  posVec = ff(c(2, 4), vmode = 'integer')
  ffdfindexget(littleFrame, posVec)

generated the following error:

Error in if (any(B < 1)) stop("B too small") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In bbatch(n, as.integer(BATCHBYTES/theobytes)) : NAs introduced by coercion

Again, the problem goes away when I restrict the memory through the command line.

schuemie commented 10 years ago

I managed to trace the problem to the bbatch function in the bit package, that attempts to convert B to an integer:

B <- as.integer(B)

but on my machine B is too big to fit in an integer, because in the function ffindexorder in ff:

ffindexordersize <- function (length, vmode, BATCHBYTES = getOption("ffmaxbytes")) 
{
    recvalbytes <- .rambytes[vmode]
    recindbytes <- .rambytes["integer"]
    bbatch(length, BATCHBYTES/(recvalbytes + 2 * recindbytes))
}

B is set to BATCHBYTES/(recvalbytes + 2 * recindbytes), and BATCHBYTES defaults to getOption("ffmaxbytes"), which on my machine is 85,873,131,520.

I now run all my code by starting with

options(ffmaxbytes = min(getOption("ffmaxbytes"),.Machine$integer.max * 12))

and that makes the problem go away. It still would be nice to solve the problem in the package, but I guess the right place to fix it would be in the bbatch function, which will work just fine if B is converted to a numeric instead of an integer. However, that's the bit package, not yours.

Sorry for bothering you!