DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
https://diskframe.com
Other
595 stars 40 forks source link

Data.table scope issue in functions #342

Open DavideMessinaARS opened 3 years ago

DavideMessinaARS commented 3 years ago

I'm new to disk.frame so maybe I'm misunderstanding how it works with data.table.

I run disk.frame version 0.50 and data.table version 1.14.0

library(disk.frame)
library(data.table)
setup_disk.frame()

test_dt = as.disk.frame(data.table(x = seq_len(10)), outdir = file.path(tempdir(), "test"), overwrite = TRUE)

test_fun <- function(fun_dt) {
  col_vect <- "x"
  print(fun_dt[, max(get(col_vect))])
}

col_vect <- "x"

test_fun(test_dt)
# return [1]  5 10

rm(col_vect)

test_fun(test_dt)
# return Error

The traceback for the error is:

Error in get(col_vect) : object 'col_vect' not found 
13. stop(condition) 
12. signalConditions(obj, exclude = getOption("future.relay.immediate", "immediateCondition"),
      resignal = resignal, ...) 
11. signalConditionsASAP(obj, resignal = FALSE, pos = ii) 
10. resolve.list(y, result = TRUE, stdout = stdout, signal = signal, force = TRUE) 
9. resolve(y, result = TRUE, stdout = stdout, signal = signal, force = TRUE) 
8. value.list(fs) 
7. value(fs) 
6. future_xapply(FUN = FUN, nX = nX, chunk_args = X, args = list(...),
    get_chunk = `[`, expr = expr, envir = envir, future.globals = future.globals,
    future.packages = future.packages, future.scheduling = future.scheduling,
    future.chunk.size = future.chunk.size, future.stdout = future.stdout,  ... 
5. future.apply::future_lapply(get_chunk_ids(df, strip_extension = FALSE), 
    function(chunk_id) {
        chunk = get_chunk(df, chunk_id, keep = keep_for_future)
        data.table::setDT(chunk) ... 
4. `[.disk.frame`(fun_dt, , max(get(col_vect))) 
3. fun_dt[, max(get(col_vect))] 
2. print(fun_dt[, max(get(col_vect))]) 
1. test_fun(test_dt)
xiaodaigh commented 3 years ago

there's an issue with disk.frame where it doesn't wor within functions. it's to do with the global scope and NSE. I am designing a revamp of how disk.frame handles NSE. But the caveat is that functions are unlikely to compose well.

So this is a "known" issue.

DavideMessinaARS commented 3 years ago

I found a workaround to the scope issue by sending the objects to the GlobalEnv:

test_fun <- function(fun_dt) {
  col_vect <<- "x"
  print(fun_dt[, max(get(col_vect))])
}

(or using assign)

The problem is I can't modify the function I'm using so I'll need to wait for a fix to disk.frame or program myself a stopgap solution.

In any case, thanks for your help.