apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[R] "open_dataset(f) %>% filter(id %in% myvec) %>% collect" causes CPP11::unwind_execption, crashed R #33487

Open asfimport opened 1 year ago

asfimport commented 1 year ago

This is running on a windows environment, arrow 10.0.0 (see arrow_info() below). data size is large maybe 

I issued two calls


ft <- path_to_dataset1
fa <- path_to_dataset2

#1)

tic()
d2 <- ft %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
toc()
927.11 sec elapsed

#returned a dataset with 44 obs, 38 columns, took abnormal time, 16min

#1)

tic()
d3 <- fa %>% open_dataset %>% filter( pis %in% mypis ) %>% collect
terminate called after throwing an instance of 'cpp11::unwind_exception'

Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too.

image-2022-11-11-14-59-30-132.png

 

arrow_info() Arrow package version: 10.0.0

Capabilities:                 dataset    TRUE substrait FALSE parquet    TRUE json       TRUE s3         TRUE gcs        TRUE utf8proc   TRUE re2        TRUE snappy     TRUE gzip       TRUE brotli     TRUE zstd       TRUE lz4        TRUE lz4_frame  TRUE lzo       FALSE bz2        TRUE jemalloc  FALSE mimalloc   TRUE

Arrow options():                         arrow.use_threads FALSE

Memory:                    Allocator mimalloc Current    0 bytes Max        0 bytes

Runtime:                          SIMD Level          avx2 Detected SIMD Level avx2

Build:                                                               C++ Library Version                                    10.0.0 C++ Compiler                                              GNU C++ Compiler Version                                   10.3.0 Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

 

 

 

Reporter: Lucas Mation / @lucasmation

Original Issue Attachments:

Note: This issue was originally created as ARROW-18314. Please see the migration documentation for further details.

asfimport commented 1 year ago

Nicola Crane / @thisisnic: Hi @lucasmation, thanks for reporting this. I notice that the code above uses collect() - this pulls the data into memory, and so you could be correct that the data size is the issue here. What is the size of each of those datasets, and how much memory do you have on this machine?

What are the values of ft %>% open_dataset() %>% nrow() and ft %>% open_dataset %>% filter( pis %in% mypis ) %>% nrow(), so we can so how much of this data is then being read?

asfimport commented 1 year ago

Lucas Mation / @lucasmation: @thisisnic , the filtered dataset is tiny: 44 obs and 38cols. The original dataset is huge: 801million obs (801435094). The server is large 512Gb or RAM. There are other users sharing the server, but I haven't seen it error due to maxing out the RAM.

asfimport commented 1 year ago

Nicola Crane / @thisisnic: Hmm, not sure what to suggest here, though I wonder if this has similar causes as ARROW-18313