edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

Undesired coercion of vmodes when using ffdfdply or ffdfappend #31

Closed ghost closed 10 years ago

ghost commented 10 years ago

When using ffdfdply, with certain kinds of FUN functions, a variable that was originally of vmode double will be coerced to vmode integer and turned into a factor whose levels are character strings of the numerical output.

For me, this happens when the input ff is of limited precision (e.g. 3 digits), but the output of the FUN is of higher precision (e.g. 3.1 /3 = 1.33333333333...).

I believe this is due to the call to ffdfappend; I have produced this result using ffdfappend by itself, importing data from a a text file to an ffdf and then appending to an existing ffdf in a for-loop.

Here is a simple example, with different variations:

n <- 10000
fake.data <- data.frame(a=rep(LETTERS,n))

# put in low precision numerical values
fake.data$b <- sample(1:9,n*26,replace=T) + 
  0.1*sample(1:9,n*26,replace=T)

fake.data.ff <- as.ffdf(fake.data)

vmode(fake.data.ff)
# a         b 
# "integer"  "double" 

fake.data.fun <- function(x) {
  aggregate(b ~ a, data=x,FUN=mean)
}

# works fine as data.frame
out.df <- fake.data.fun(fake.data)

# make BATCHBYTES small, so it's forced to iterate
out.ff.bad <- ffdfdply(x =fake.data.ff, split=fake.data.ff$a, 
                       FUN= function(x) fake.data.fun(x), BATCHBYTES=256)
vmode(out.ff.bad)
# a         b 
# "integer" "integer"

# when it doesn't iterate, it's fine
out.ff.ok <- ffdfdply(x =fake.data.ff, split=fake.data.ff$a, FUN= function(x) fake.data.fun(x))
vmode(out.ff.ok)    
# a         b 
# "integer"  "double"

I have found a tedious work around when using ffdfappend, by readjusting the significant digits of the existing ffdf and the appended ffdf using signif.ff, but this solution won't work for ffdfdply, and the problem may be more general than just with incompatability of numerical precisions.

jwijffels commented 10 years ago

Thanks for reporting this. This is bizarre. When I'm running this I get

vmode(out.ff.bad) a b "integer" "double"

Are you running the lastest version of ffbase?

library(devtools) install_github("edwindj/ffbase", subdir="pkg")

I think this is the same issue as issue #19 which was solved

jwijffels commented 10 years ago

Just doublechecked this and indeed it is issue #19 (ffdfappend issue) which is solved already Maybe we should put the new version on CRAN...

In the new version of ffdfappend we get

x <- as.ffdf(data.frame(a = factor("A", levels = LETTERS), b = 5.53)) vmode(x) a b "integer" "double" x <- ffdfappend(x, data.frame(a = factor("B", levels = LETTERS), b = 5.54)) vmode(x) a b "integer" "double" x <- ffdfappend(x, data.frame(a = factor("C", levels = LETTERS), b = 5.49)) vmode(x) a b "integer" "double"

But in the old version of ffdfappend we got

x <- as.ffdf(data.frame(a = factor("A", levels = LETTERS), b = 5.53)) vmode(x) a b "integer" "double" x <- ffdfappend(x, data.frame(a = factor("B", levels = LETTERS), b = 5.54)) vmode(x) a b "integer" "double" x <- ffdfappend(x, data.frame(a = factor("C", levels = LETTERS), b = 5.49)) vmode(x) a b "integer" "integer"

basically due to an issue in the ff package

edwindj commented 10 years ago

@jwijffels Thanks! And we should put a new version in CRAN

edwindj commented 10 years ago

Should be on CRAN this evening...