DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
https://diskframe.com
Other
595 stars 40 forks source link

Problem with non-standard evaluation in disk.frame objects using data.table syntax #369

Closed entjos closed 2 years ago

entjos commented 2 years ago

Hi there!

I recently run into an error with non-standard evaluation in a disk.frame call where I'm not quite sure whether it is a bug or necessary behavior. I posted my error as a question on StackOverflow last week and will just copy my post below:

Problem

I'm currently trying to write a function that filters some rows of a disk.frame object using regular expressions. I, unfortunately, run into some issues with the evaluation of my search string in the filter function. My idea was to pass a regular expression as a string into a function argument (e.g. storm_name) and then pass that argument into my filtering call. I used the %like% function included in {data.table} for filtering rows.

My problem is that the storm_name object gets evaluated inside the disk.frame. However, since the storm_name is only included in the function environment, but not in the disk.frame object, I get the following error:

Error in .checkTypos(e, names_x) : 
  Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

I already tried to evaluate the storm_nameobject in the parent frame using eval(sotm_name, env = parent.env()), but that also didn't help. Interestingly, this problem only occurs with {disk.frame} objects but not with {data.table} objects.

For now I found a solution using {dplyr} instead. However, I would be grateful for any ideas on how this problem could be solved with {data.table}.

Reproducible Example

# Load packages
library(data.table)
library(disk.frame)

# Create data table and diskframe object of storm data
storms_df <- as.disk.frame(storms)
storms_dt <- as.data.table(storms)

# Create search function
grep_storm_name <- function(dfr, storm_name){

  dfr[name %like% storm_name]

}

# Check function with data.table object
grep_storm_name(storms_dt, "^A")

# Check function with diskframe object
grep_storm_name(storms_df, "^A")

Session Info

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Sweden.1252  LC_CTYPE=English_Sweden.1252    LC_MONETARY=English_Sweden.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] disk.frame_0.5.0  purrr_0.3.4       dplyr_1.0.7       data.table_1.14.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7            benchmarkmeData_1.0.4 pryr_0.1.4            pillar_1.6.4         
 [5] compiler_4.1.0        iterators_1.0.13      tools_4.1.0           digest_0.6.27        
 [9] bit_4.0.4             jsonlite_1.7.2        lifecycle_1.0.1       tibble_3.1.6         
[13] lattice_0.20-44       pkgconfig_2.0.3       rlang_0.4.12          Matrix_1.3-3         
[17] foreach_1.5.1         rstudioapi_0.13       DBI_1.1.1             parallel_4.1.0       
[21] bigassertr_0.1.4      bigreadr_0.2.4        httr_1.4.2            stringr_1.4.0        
[25] globals_0.14.0        generics_0.1.1        fs_1.5.0              vctrs_0.3.8          
[29] bit64_4.0.5           grid_4.1.0            tidyselect_1.1.1      glue_1.6.0           
[33] listenv_0.8.0         R6_2.5.1              future.apply_1.7.0    parallelly_1.25.0    
[37] fansi_1.0.0           magrittr_2.0.1        codetools_0.2-18      ellipsis_0.3.2       
[41] fst_0.9.4             assertthat_0.2.1      future_1.21.0         benchmarkme_1.0.7    
[45] utf8_1.2.2            stringi_1.7.6         doParallel_1.0.16     crayon_1.4.2 
xiaodaigh commented 2 years ago

Thanks for the bug report. The issue is with my poor NSE coding I think. See below extend MWE which shows that it's not detecting the value being passed into the function properly.

library(data.table)
library(disk.frame)

# Create data table and diskframe object of storm data
storms_df <- as.disk.frame(storms)
storms_dt <- as.data.table(storms)

# Create search function
grep_storm_name <- function(dfr, storm_name){

  dfr[name %like% storm_name]

}

# Check function with data.table object
grep_storm_name(storms_dt, "^A")

# Check function with diskframe object
grep_storm_name(storms_df, "^A")

storms_df[name %like% "^A"]

storm_name_outside_function="^A"
storms_df[name %like% storm_name_outside_function]

grep_storm_name(storms_df, "^A")
xiaodaigh commented 2 years ago

fixed by #370