edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

ffdf takes more RAM than R data.frame #45

Closed kindlychung closed 9 years ago

kindlychung commented 9 years ago

Here is an example:

require(ffbase)
ffiris = as.ffdf(iris)
save.ffdf(ffiris, dir = "~/Desktop/iris", overwrite = TRUE)
for(x in paste("a", 1:4e2, sep = "")) {
    ffiris[[x]] = ff(rep(314, nrow(iris)))
}
save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE)
object.size(ffiris)
dim(ffiris)
iris2 = as.ram(ffiris)
dim(iris2)
object.size(iris2)

Output in R:

>     ffiris = as.ffdf(iris)
>     save.ffdf(ffiris, dir = "~/Desktop/iris", overwrite = TRUE)
>     for(x in paste("a", 1:4e2, sep = "")) {
+         ffiris[[x]] = ff(rep(314, nrow(iris)))
+     }
>     save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE)
>     object.size(ffiris)
1241016 bytes
>     dim(ffiris)
[1] 150 405
>     iris2 = as.ram(ffiris)
>     dim(iris2)
[1] 150 405
>     object.size(iris2)
528672 bytes

The ffdf object takes almost twice as much RAM as the R data.frame. Why? Did I do something wrong?

kindlychung commented 9 years ago

object.size is said to force the loading of all data on disk, so I also looked at the task manager:

# Starting 48M
require(ffbase)
ffiris = as.ffdf(iris)
# 53M

save.ffdf(ffiris, dir = "~/Desktop/iris", overwrite = TRUE)
for(x in paste("a", 1:4e2, sep = "")) {
    ffiris[[x]] = ff(rep(314, nrow(iris)))
}
# 65M

save.ffdf(ffiris, dir="~/Desktop/iris", overwrite=TRUE)
dim(ffiris)
# 66M

iris2 = iris
for(x in paste("a", 1:4e2, sep = "")) {
    iris2[, x] = rep(314, nrow(iris))
}
dim(iris2)
# 71M

iris3 = as.ram(ffiris)
dim(iris3)
# 77M
edwindj commented 9 years ago

Your example is a biased because the iris dataset is very small (150 records). When the number of rows increases the in memory data.frame takes more memory (see code below).

However, the memory consumption is still considerable: I''m not sure why this is the case (I will try to find out, note that I'm not the author of ff)

require(ffbase)

# note that iris is only 150 records, so overhead is bigger.
object.size(iris)
ffiris = as.ffdf(iris)
object.size(ffiris)

iris_big <- iris[sample(nrow(iris), 1e4, replace = TRUE), ]
object.size(iris_big)
ffiris_big <- as.ffdf(iris_big)
object.size(ffiris_big)