apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

[R] memory allocation crash #34487

Open r2evans opened 1 year ago

r2evans commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Motived by https://stackoverflow.com/questions/75657380/readr-vs-data-table-different-results-on-fedora, I downloaded its sample data (https://www.usitc.gov/data/gravity/itpd_e/itpd_e_r02.zip) and read the CSV with various functions. I was able to read the file successfully (albeit slowly for most) using utils::read.csv, readr::read_csv, data.table::fread, and arrow::open_dataset(., format="csv"), but when I tried this, my R crashed:

packageVersion("arrow")
# [1] '10.0.1'
obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# D:/a/rtools-packages/rtools-packages/mingw-w64-arrow/src/apache-arrow-10.0.1/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Out of memory: malloc of size 262144 failed
# Process R:3 exited abnormally with code 9 at Tue Mar  7 08:30:27 2023

(FYI, I do not have a D: drive, that must be compiled into the symbols.)

I tried it again, same computer, new/fresh R process, same file, different error:

obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# terminate called after throwing an instance of 'cpp11::unwind_exception'
#   what():  std::exception
# Process R:3 exited abnormally with code 9 at Tue Mar  7 08:39:53 2023

I tried upgrading arrow and it still fails:

packageVersion("arrow")
# [1] '11.0.0.2'
obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# terminate called after throwing an instance of 'cpp11::unwind_exception'
#   what():  std::exception
# Process R:3 exited abnormally with code 9 at Tue Mar  7 08:43:51 2023

The CSV file itself is 6.8GB and, once read into R, typically consumes 7GB+ of RAM. My system is Win11 22H2 (OS Build 22621.1265) with 64GB of RAM, running R inside emacs/ess.

For perspective, the data does not appear to contain anything cosmic:

obj3 <- arrow::open_dataset("~/Downloads/ITPD_E_R02.csv", format="csv")
dat <- head(obj3) %>%
  collect()
dat
# # A tibble: 6 × 13
#   export…¹ expor…² expor…³ impor…⁴ impor…⁵ impor…⁶ broad…⁷ indus…⁸ indus…⁹  year
#   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <int> <chr>   <int>
# 1 SVU      SVU     Soviet… AFG     AFG     Afghan… Agricu…       1 Wheat    1986
# 2 SVU      SVU     Soviet… AFG     AFG     Afghan… Agricu…       1 Wheat    1987
# 3 AUS      AUS     Austra… AFG     AFG     Afghan… Agricu…       1 Wheat    1989
# 4 FIN      FIN     Finland AFG     AFG     Afghan… Agricu…       1 Wheat    1989
# 5 IND      IND     India   AFG     AFG     Afghan… Agricu…       1 Wheat    1990
# 6 BLX      BLX     Belgiu… AFG     AFG     Afghan… Agricu…       1 Wheat    1990
# # … with 3 more variables: trade <dbl>, flag_mirror <int>, flag_zero <chr>, and
# #   abbreviated variable names ¹​exporter_iso3, ²​exporter_dynamic_code,
# #   ³​exporter_name, ⁴​importer_iso3, ⁵​importer_dynamic_code, ⁶​importer_name,
# #   ⁷​broad_sector, ⁸​industry_id, ⁹​industry_descr
# # ℹ Use `colnames()` to see all variable names
dput(dat)
# structure(list(exporter_iso3 = c("SVU", "SVU", "AUS", "FIN", 
# "IND", "BLX"), exporter_dynamic_code = c("SVU", "SVU", "AUS", 
# "FIN", "IND", "BLX"), exporter_name = c("Soviet Union", "Soviet Union", 
# "Australia", "Finland", "India", "Belgium-Luxembourg"), importer_iso3 = c("AFG", 
# "AFG", "AFG", "AFG", "AFG", "AFG"), importer_dynamic_code = c("AFG", 
# "AFG", "AFG", "AFG", "AFG", "AFG"), importer_name = c("Afghanistan", 
# "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"
# ), broad_sector = c("Agriculture", "Agriculture", "Agriculture", 
# "Agriculture", "Agriculture", "Agriculture"), industry_id = c(1L, 
# 1L, 1L, 1L, 1L, 1L), industry_descr = c("Wheat", "Wheat", "Wheat", 
# "Wheat", "Wheat", "Wheat"), year = c(1986L, 1987L, 1989L, 1989L, 
# 1990L, 1990L), trade = c(14.761, 1.98, 0.191, 0.175, 0.553, 0.36
# ), flag_mirror = c(1L, 1L, 1L, 1L, 1L, 1L), flag_zero = c("p", 
# "p", "p", "p", "p", "p")), class = c("tbl_df", "tbl", "data.frame"
# ), row.names = c(NA, -6L))

(I recognize that data of this size should be (at least) opened lazily using open_dataset or converted to a better storage format, that's not the point of this issue.)


Session info:

sessioninfo::session_info()
# ─ Session info ───────────────────────────────────────────────────────────────
#  setting  value
#  version  R version 4.2.2 (2022-10-31 ucrt)
#  os       Windows 10 x64 (build 22621)
#  system   x86_64, mingw32
#  ui       RTerm
#  language (EN)
#  collate  English_United States.utf8
#  ctype    English_United States.utf8
#  tz       America/New_York
#  date     2023-03-07
#  pandoc   2.17.1.1 @ C:/Users/r2/AppData/Local/Pandoc/ (via rmarkdown)
# ─ Packages ───────────────────────────────────────────────────────────────────
#  package     * version date (UTC) lib source
#  cli           3.4.1   2022-09-23 [1] RSPM (R 4.2.0)
#  digest        0.6.31  2022-12-11 [1] RSPM (R 4.2.0)
#  evaluate      0.19    2022-12-13 [2] CRAN (R 4.2.2)
#  fastmap       1.1.0   2021-01-25 [2] CRAN (R 4.2.2)
#  htmltools     0.5.4   2022-12-07 [1] RSPM (R 4.2.0)
#  knitr         1.41    2022-11-18 [1] RSPM (R 4.2.0)
#  r2          * 0.9.15  2022-12-14 [1] local
#  rlang         1.0.6   2022-09-24 [1] RSPM (R 4.2.0)
#  rmarkdown     2.18    2022-11-09 [1] RSPM (R 4.2.0)
#  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.2.0)
#  xfun          0.35    2022-11-16 [1] RSPM (R 4.2.0)
#  [1] C:/Users/r2/AppData/Local/R/win-library/4.2
#  [2] C:/R/R-4.2.2/library
# ──────────────────────────────────────────────────────────────────────────────

Component(s)

R

eitsupi commented 1 year ago

I tried it with R on Ubuntu 22.04 and arrow installed from RSPM binary, and was able to read CSV successfully. (10GB RAM used) is it possible that this is a bug related to how arrow is installed or the OS?

R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Platform: x86_64-pc-linux-gnu (64-bit)

> obj3 <- arrow::read_csv_arrow("ITPD_E_R02.csv", as_data_frame = FALSE)

> obj3
Table
72534869 rows x 13 columns
$exporter_iso3 <string>
$exporter_dynamic_code <string>
$exporter_name <string>
$importer_iso3 <string>
$importer_dynamic_code <string>
$importer_name <string>
$broad_sector <string>
$industry_id <int64>
$industry_descr <string>
$year <int64>
$trade <double>
$flag_mirror <int64>
$flag_zero <string>