Open asfimport opened 1 year ago
Nicola Crane / @thisisnic:
Hi @lucasmation, thanks for reporting this. I notice that the code above uses collect()
- this pulls the data into memory, and so you could be correct that the data size is the issue here. What is the size of each of those datasets, and how much memory do you have on this machine?
What are the values of ft %>% open_dataset() %>% nrow()
and ft %>% open_dataset %>% filter( pis %in% mypis ) %>% nrow()
, so we can so how much of this data is then being read?
Lucas Mation / @lucasmation: @thisisnic , the filtered dataset is tiny: 44 obs and 38cols. The original dataset is huge: 801million obs (801435094). The server is large 512Gb or RAM. There are other users sharing the server, but I haven't seen it error due to maxing out the RAM.
Nicola Crane / @thisisnic: Hmm, not sure what to suggest here, though I wonder if this has similar causes as ARROW-18313
This is running on a windows environment, arrow 10.0.0 (see arrow_info() below). data size is large maybe
I issued two calls
Then I got an error that craspad_hendler.exe stopped working. And R becomes frozen, after a while R crashed too.
arrow_info() Arrow package version: 10.0.0
Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE
Arrow options(): arrow.use_threads FALSE
Memory: Allocator mimalloc Current 0 bytes Max 0 bytes
Runtime: SIMD Level avx2 Detected SIMD Level avx2
Build: C++ Library Version 10.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0
Reporter: Lucas Mation / @lucasmation
Original Issue Attachments:
Note: This issue was originally created as ARROW-18314. Please see the migration documentation for further details.