edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

ffdfdply returns NA if only one split exists in the data #54

Closed chrisvwn closed 6 years ago

chrisvwn commented 6 years ago

When applying a function to an ffdf using ffdfdply, if the split parameter has only one level, the split level is returned as NA and seemingly all functions read the data as NA. I assume that ffdfdply should work with one split as well as multiple splits? For example:

rastVals <- 1:10000
zoneVals <- rep(1, 10000)

vals <- ff::ff(initdata = rastVals, finalizer = "delete", overwrite = T)
zones <- ff::ff(initdata = zoneVals, finalizer = "delete", overwrite = T)
rDT <- ff::ffdf(zones, vals)

result <- ffbase::ffdfdply(x=rDT,
                           split=as.character(zones),
                           trace=TRUE,
                           BATCHBYTES = 80.85*2^20,
                           FUN = function(dta){
                             ## This happens in RAM - containing **several** split 
                             #elements so here we can use data.table which works 
                             #fine for in RAM computing
                             dta <- data.table::as.data.table(dta)

                             #calc aggregations
                             result <- dta[, as.list(unlist(lapply(.SD, function(x) list(sum=sum(x, na.rm = TRUE), mean=mean(x, na.rm = TRUE))))), by=zones]

                             as.data.frame(result)
                           })

returns the following result:

> result
ffdf (all open) dim=c(1,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
          PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones            zones       double        double FALSE           FALSE            FALSE                 1                1
vals.sum      vals.sum       double        double FALSE           FALSE            FALSE                 2                1
vals.mean    vals.mean       double        double FALSE           FALSE            FALSE                 3                1
          PhysicalLastCol PhysicalIsOpen
zones                   1           TRUE
vals.sum                1           TRUE
vals.mean               1           TRUE
ffdf data
  zones vals.sum vals.mean
1    NA        0       NaN

This is easily checked by adding a second split in the zones e.g. changing the zoneVals entries to zoneVals <- c(rep(1, 5000), rep(2, 5000)) gives the following result:

> result
ffdf (all open) dim=c(2,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
          PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones            zones       double        double FALSE           FALSE            FALSE                 1                1
vals.sum      vals.sum       double        double FALSE           FALSE            FALSE                 2                1
vals.mean    vals.mean       double        double FALSE           FALSE            FALSE                 3                1
          PhysicalLastCol PhysicalIsOpen
zones                   1           TRUE
vals.sum                1           TRUE
vals.mean               1           TRUE
ffdf data
       zones   vals.sum  vals.mean
1        1.0 12502500.0     2500.5
2        2.0 37502500.0     7500.5

Did I miss something in how ffdfdply works? I can calculate one split directly, sure, but when working with dynamic aggregation it makes sense for ffdfdply to handle multiple and single splits?

jwijffels commented 6 years ago

You are completely right. Thanks for reporting. I've updated the package with a small fix to also allow to have only 1 split level. Although ffdfdply is intended to be used if you have a lot more split levels to get groups of split levels in RAM, indeed if you have only 1 split level, it now also works correctly. Thanks again.