etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
141 stars 3 forks source link

parallel processing with furrr breaks on dplyr verbs #91

Closed apsteinmetz closed 4 months ago

apsteinmetz commented 4 months ago
library(tidyverse)
library(furrr)
#> Loading required package: future
library(tidypolars)
plan(multisession)

do_stuff <- function(n){
   cars <- cbind(model = row.names(mtcars),mtcars) |> as_polars_df()
   cars |>
      group_by(cyl) |>
      summarize(mean_hp = mean(hp)) |>
      arrange(cyl)

}

n <- 100
future_map(1:n, do_stuff) |>
      bind_rows_polars()
#> Error:
#> ℹ In index: 1.
#> Caused by error in `UseMethod()`:
#> ! no applicable method for 'group_by' applied to an object of class "RPolarsDataFrame"

If I omit group_by() then it chokes on summarize()

If we change the future::plan() to "sequential", which is, in effect, base R. There is no error.

plan(sequential)

future_map(1:n, do_stuff) |>
      bind_rows_polars()
#> shape: (300, 2)
#> ┌─────┬────────────┐
#> │ cyl ┆ mean_hp    │
#> │ --- ┆ ---        │
#> │ f64 ┆ f64        │
#> ╞═════╪════════════╡
#> │ 4.0 ┆ 82.636364  │
#> │ 6.0 ┆ 122.285714 │
#> │ 8.0 ┆ 209.214286 │
#> │ 4.0 ┆ 82.636364  │
#> │ 6.0 ┆ 122.285714 │
#> │ …   ┆ …          │
#> │ 6.0 ┆ 122.285714 │
#> │ 8.0 ┆ 209.214286 │
#> │ 4.0 ┆ 82.636364  │
#> │ 6.0 ┆ 122.285714 │
#> │ 8.0 ┆ 209.214286 │
#> └─────┴────────────┘
sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] tidypolars_0.5.0.9000 furrr_0.3.1           future_1.33.1        
#>  [4] lubridate_1.9.3       forcats_1.0.0         stringr_1.5.1        
#>  [7] dplyr_1.1.4           purrr_1.0.2           readr_2.1.5          
#> [10] tidyr_1.3.0           tibble_3.2.1          ggplot2_3.4.4        
#> [13] tidyverse_2.0.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.2     utf8_1.2.4        generics_0.1.3    stringi_1.8.3    
#>  [5] listenv_0.9.0     hms_1.1.3         digest_0.6.34     magrittr_2.0.3   
#>  [9] evaluate_0.23     grid_4.3.0        timechange_0.2.0  fastmap_1.1.1    
#> [13] R.oo_1.25.0       R.cache_0.16.0    R.utils_2.12.3    fansi_1.0.6      
#> [17] scales_1.3.0      codetools_0.2-19  cli_3.6.2         rlang_1.1.3      
#> [21] R.methodsS3_1.8.2 parallelly_1.36.0 munsell_0.5.0     reprex_2.1.0     
#> [25] withr_3.0.0       yaml_2.3.8        tools_4.3.0       parallel_4.3.0   
#> [29] tzdb_0.4.0        colorspace_2.1-0  globals_0.16.2    vctrs_0.6.5      
#> [33] R6_2.5.1          lifecycle_1.0.4   fs_1.6.3          pkgconfig_2.0.3  
#> [37] pillar_1.9.0      gtable_0.3.4      glue_1.7.0        xfun_0.41        
#> [41] tidyselect_1.2.0  rstudioapi_0.15.0 knitr_1.45        htmltools_0.5.7  
#> [45] rmarkdown_2.25    compiler_4.3.0    polars_0.14.1
etiennebacher commented 4 months ago

I can reproduce, thanks for the reprex. I'll look into it

etiennebacher commented 4 months ago

Looks like this is exactly the issue described in this section of the future docs: https://future.futureverse.org/articles/future-4-issues.html#missing-packages-false-negatives

Using furrr_options() to specify the packages used in the function should work:

future_map(1:n, do_stuff, .options = furrr_options(packages = "tidypolars"))

However, there's another issue in polars that crashes R when handling a list of DataFrames created with future. I reported it here: https://github.com/pola-rs/r-polars/issues/851

There's nothing more I can do in tidypolars to fix this, so I'm closing. Thanks for the report

apsteinmetz commented 4 months ago

Alas, when I explicitly attach the package with future_map(1:n, do_stuff,.options = furrr_options(packages = c("tidypolars","polars"))), things get even more interesting. R crashes with the "R Session Aborted" bomb. BUT, when I omit library(tidyverse), so the only dplyr-ish verbs come from tidypolars , I am back to the ! could not find function "arrange" error. I recognize this is probably a future package issue but it's interesting.

etiennebacher commented 4 months ago

when I omit library(tidyverse), so the only dplyr-ish verbs come from tidypolars , I am back to the ! could not find function "arrange" error

This isn't related to future, it also happens with a simpler example:

library(tidypolars)

mtcars |>
  as_polars_df() |> 
  arrange(mpg)
#> Error in arrange(as_polars_df(mtcars), mpg): could not find function "arrange"

tidypolars doesn't reexport tidyverse functions so you need to load tidyverse packages to use tidypolars. This is because when I started tidypolars I didn't necessarily want to import dplyr and tidyr. I changed my mind about this, so I suppose I should reexport their functions so that users only need to load tidypolars, just like tidytable does for instance.

etiennebacher commented 4 months ago

@apsteinmetz It is actually expected that polars (and therefore tidypolars) does not work with future and plan(multisession). Basically, future creates multiple sessions to run the computation and then export the results from each session to the "main" one (from which it was called). However, future cannot export external pointers (see this section of the docs).

Since polars calls Rust code in the background and therefore relies on external pointers, it cannot work with future with this type of plan. It probably shouldn't crash if this is the case, but you can also set options(future.globals.onReference = "error") at the top of your script to abort early when future detects external pointers in its output:

library(tidyverse)
library(furrr)
#> Loading required package: future
library(tidypolars)
plan(multisession)

### Without this, the session would crash
options(future.globals.onReference = "error")

do_stuff <- function(n){
  cars <- cbind(model = row.names(mtcars),mtcars) |> as_polars_df()
  cars |>
    group_by(cyl) |>
    summarize(mean_hp = mean(hp)) |>
    arrange(cyl)

}

n <- 2
future_map(1:n, do_stuff, .options = furrr_options(packages = "tidypolars")) |> 
  bind_rows_polars()
#> Error: Detected a non-exportable reference ('externalptr' of class 'tidypolars') in the value (of class 'list') of the resolved future
etiennebacher commented 4 months ago

Final update on this: from polars 0.15.0, this will now error, even if the option future.globals.onReference is not set:

library(tidyverse)
library(furrr)
#> Loading required package: future
library(tidypolars)
plan(multisession)

options(polars.do_not_repeat_call = TRUE)

do_stuff <- function(n){
  cars <- cbind(model = row.names(mtcars),mtcars) |> as_polars_df()
  cars |>
    summarize(mean_hp = mean(hp)) 
}

n <- 2
future_map(1:n, do_stuff, .options = furrr_options(packages = "tidypolars")) |> 
  bind_rows_polars()
#> Error: Execution halted with the following contexts
#>    0: In R: in pl$concat()
#>    1: The argument [l] caused an error
#>    2: Possibly because element no. [1] 
#>    3: Expected a value of type [r_polars::lazy::dataframe::RPolarsLazyFrame]
#>    4: Got value [ExternalPtr.set_class(["tidypolars", "RPolarsDataFrame"]]
#>    5: This Polars object is not valid. Execute `rm(<object>)` to remove the object or restart the R session.