apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.57k stars 3.54k forks source link

[R] Arrow filter crashes (R aborted session) #29329

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

Hi,

 

I encounter a fatal error with the new version of Arrow R (5.0.0) that I did not have with its older version (4.0.1). Basically, after running "open_dataset", I filter and collect the data into a dataframe; then RStudio crashes :

 


ds <- arrow::open_dataset(sources = "XXXX", partitioning = c("XX","YY","ZZ"))
df<- ds %>%
 filter(year >= 2014 & year <= 2020 & type %in% c("XX", "YY") & sector == "ABC" & identifier %in% list_identifiers & type == "LE" & val == "M") %>%
 select(period, obs_value) %>%
collect()

 

If I run the code above without "filter", I do not have any problem. I guess there is something wrong in the filtering expression.

 

Unfortunately, I cannot reproduce the exact code neither the problem. The dataset is very large and I did not understand the precise source of the error. Eveything I know is that my R Studio crashes and that this code worked perfectly in the older version of the package.

Also, please note that I disabled multithreading with :


options(arrow.use_threads = FALSE)

 

 

Environment: RStudio Version

1.4.1103

Session Information

R version 4.0.4 (2021-02-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] readxl_1.3.1 RJDBC_0.2-8 rJava_1.0-4 tibbletime_0.1.6 arrow_4.0.0.1
[6] rdbnomics_0.6.4 rstudioapi_0.13 scales_1.1.1 tidyquant_1.0.3 quantmod_0.4.18
[11] TTR_0.24.2 PerformanceAnalytics_2.0.4 xts_0.12.1 zoo_1.8-9 skimr_2.1.3
[16] janitor_2.1.0 DBI_1.1.1 R.utils_2.10.1 R.oo_1.24.0 R.methodsS3_1.8.1
[21] devtools_2.4.2 usethis_2.0.1 R.cache_0.15.0 rmarkdown_2.10 kableExtra_1.3.4
[26] knitr_1.33 plotly_4.9.4.1 RColorBrewer_1.1-2 ggpubr_0.4.0 ggrepel_0.9.1
[31] ggExtra_0.9 haven_2.4.3 sas7bdat_0.5 data.table_1.14.0 lubridate_1.7.10
[36] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.1
[41] tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5 tidyverse_1.3.1

loaded via a namespace (and not attached): [1] colorspace_2.0-2 ggsignif_0.6.2 ellipsis_0.3.2 rio_0.5.27 rprojroot_2.0.2 snakecase_0.11.0 base64enc_0.1-3 fs_1.5.0
[9] remotes_2.4.0 bit64_4.0.5 fansi_0.5.0 xml2_1.3.2 cachem_1.0.5 pkgload_1.2.1 jsonlite_1.7.2 broom_0.7.9
[17] dbplyr_2.1.1 shiny_1.6.0 compiler_4.0.4 httr_1.4.2 backports_1.2.1 assertthat_0.2.1 fastmap_1.1.0 lazyeval_0.2.2
[25] cli_3.0.1 later_1.2.0 htmltools_0.5.1.1 prettyunits_1.1.1 tools_4.0.4 gtable_0.3.0 glue_1.4.2 Rcpp_1.0.7
[33] carData_3.0-4 cellranger_1.1.0 vctrs_0.3.8 svglite_2.0.0 xfun_0.25 ps_1.6.0 openxlsx_4.2.4 testthat_3.0.4
[41] rvest_1.0.1 mime_0.11 miniUI_0.1.1.1 lifecycle_1.0.0 rstatix_0.7.0 hms_1.1.0 promises_1.2.0.1 curl_4.3.2
[49] memoise_2.0.0 stringi_1.7.3 desc_1.3.0 pkgbuild_1.2.0 zip_2.2.0 repr_1.1.3 rlang_0.4.11 pkgconfig_2.0.3
[57] systemfonts_1.0.2 lattice_0.20-41 evaluate_0.14 htmlwidgets_1.5.3 bit_4.0.4 tidyselect_1.1.1 processx_3.5.2 magrittr_2.0.1
[65] R6_2.5.1 generics_0.1.0 pillar_1.6.2 foreign_0.8-81 withr_2.4.2 abind_1.4-5 modelr_0.1.8 crayon_1.4.1
[73] car_3.0-11 Quandl_2.11.0 utf8_1.2.2 tzdb_0.1.2 callr_3.7.0 reprex_2.0.1 digest_0.6.27 webshot_0.5.2
[81] xtable_1.8-4 httpuv_1.6.1 munsell_0.5.0 viridisLite_0.4.0 quadprog_1.5-8 sessioninfo_1.1.1

System Information

sysname : Windows
release : 10 x64
version : build 18363 machine : x86-64

Platform Information

OS.type : windows file.sep : / dynlib.ext : .dll GUI : RStudio endian : little pkgType : win.binary path.sep : ; r_arch : x64

R Version

platform : x86_64-w64-mingw32 arch : x86_64 os : mingw32 system : x86_64, mingw32 status : major : 4 minor : 0.4 year : 2021 month : 02 day : 15 svn rev : 80002 language : R version.string : R version 4.0.4 (2021-02-15) nickname : Lost Library Book Reporter: Pal

Note: This issue was originally created as ARROW-13694. Please see the migration documentation for further details.

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: It will be hard for us to debug this if you can't isolate a reproducible example. Only other thing I can suggest is to try


set_cpu_count(1)
set_io_thread_count(1)

It seems there are other places that have separate multithreading controls. See if setting one of these makes the crash stop, that might give us a clue.

asfimport commented 3 years ago

Pal: Many thanks @nealrichardson for your prompt reply. I would be delighted if I could replicate the issue, but unfortunately the dataset I use is very large and I cannot share it.

I tested with your settings, but it does not change and RStudio keeps crashing. If I isolate some filtering expressions, for instance, I only run

 


df<- ds %>%
 filter(year >= 2014 & year <= 2020 & type %in% c("XX", "YY") & type == "LE" & val == "M") %>%
 select(period, obs_value) %>%
collect()

 

this works without error. However, if add to the code


identifier %in% list_identifiers & sector %in% (list_sectors)

then I have a crash. Please note that list_identifiers is a large list (approx. 168MB), but list_sectors it is not. With the version 4.0.1, the code run smoothly.

Is there a way I can get the RStudio Log (I have a basic user account) ?

 

asfimport commented 3 years ago

Neal Richardson / @nealrichardson: When does it crash: when you call filter() or when you call collect()? Does it require both identifier %in% and sector %in% to crash (i.e. if you add one or the other it is fine, but it crashes with both)?

Also, it looks like you're running 32-bit R. Is there a reason you can't run the 64-bit R? I would guess that it works on 64-bit.

asfimport commented 3 years ago

Pal: Many thanks @nealrichardson.

Regarding your first question on whether the crash occurs on filter or collect, it looks like that it happen when collect() is invoked. If I only run


df<- ds %>%
 filter(year >= 2014 & year <= 2020 & type %in% c("XX", "YY") & type == "LE" & val == "M")

I do not have a crash. However, if I add collect(), then the crash occurs.

On your second point, the crash does not require both filtering expressions (identifier %in% or sector %in%). I've tested all combinations, with no change in behaviour.

 

Concerning your third point, I confirm you that I am running R 64bit as following :


> Sys.info()[["machine"]]

[1] "x86-64"

 

Finally, I've tested the same code and the same dataset on a Mac 64bit. I have no crash there.

Here is what I found out in the Security and Maintenance Control Panel (after R/ RStudio crashes):


Faulting Module Name : C:\Users\Public\R\4.0\RPackages\arrow\libs\x64\arrow.dll
Exception Code : 0xc0000005

 

asfimport commented 3 years ago

Weston Pace / @westonpace: Potentially related: ARROW-13761

asfimport commented 3 years ago

Pal: Thanks @westonpace. I can reproduce the issue described on ARROW-13761

It does crash on my Mac also, while the first issue only crashes on Windows (to the best of my knowledge).

asfimport commented 3 years ago

Pal: Anyone who can help me to debug and solve this please ?

asfimport commented 3 years ago

Pal: The issue seems resolved in ARROW-13761. I do not encounter anymore the problem by installing the latest version of the package (nighly build Sept. 23rd, 2021). If I go back to its Cran version, then the fatal crash occurs. I guess that the issue is closed. Many thanks for the work done.