Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.51k stars 967 forks source link

fread crashes R (bus error) for large gzipped csv file when specifying a large number of nrows (but whole file works) #5882

Open dbgoodman opened 5 months ago

dbgoodman commented 5 months ago

I have large gzipped csv file with ~40 million rows. I can read the whole file into R fine:

filename <- 'myfile.csv'
table_whole <- fread(filename)

I can also load about 5e6 rows into R fine:

filename <- 'myfile.csv'
table_first5e6rows <- fread(filename, nrows=5e6)

But, when I load 10e6 rows, it crashes R:

filename <- 'myfile.csv'
table_first5e6rows <- fread(filename, nrows=1e7)
 *** caught bus error ***
address 0x20000180e, cause 'invalid alignment'

Here is the actual output with the actual file, with verbose on:

> library(data.table)
data.table 1.14.10 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com
> fread('/Users/dbg/Library/CloudStorage/Box-Box/tcsl/ngs_data/2023.09.12.illumina_28/out/merged_df.csv.gz', nrows=1e7, verbose=T)
  OpenMP version (_OPENMP)       202011
  omp_get_num_procs()            16
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          16
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 8 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 8 threads (omp_get_max_threads()=16, nth=8)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /var/folders/r9/rhdgkfwx0msg9ldkp_6ltypr0000gn/T//RtmpkxLujJ/file1018d1112866
  File opened, size = 10.67GB (11454682052 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<itam_bc_o,itam_umi_o,costim_bc>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 94 fields using quote rule 0
  Detected 94 columns on line 1. This line is either column names or first data row. Line starts as: <<itam_bc_o,itam_umi_o,costim_bc>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 94
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because nrow limit (10000000) supplied
  Type codes (jump 000)    : CCCC755555555555555555555555555555555555555555555555555555555555555555555555555555555555555555  Quote rule 0
  'header' determined to be true due to column 5 containing a string on row 1 and a lower type (float64) in the rest of the 100 sample rows
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : CCCC755555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
[10] Allocate memory for the datatable
  Allocating 94 column slots (94 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=11454676303
  Too few rows allocated. Allocating additional 11999900 rows (now nrows=10000000) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=11454676303

 *** caught bus error ***
address 0x20000180e, cause 'invalid alignment'

Traceback:
 1: fread("/Users/dbg/Library/CloudStorage/Box-Box/tcsl/ngs_data/2023.09.12.illumina_28/out/merged_df.csv.gz",     nrows = 1e+07, verbose = T)

I have other similarly sized gzipped CSVs (including some that are 20% larger) that work fine, so it is something about the combination of this file, and asking nrows to load in a large number of rows.

Here is a link to the file on box (700MB).

Here is my session info:

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.2
tdhock commented 5 months ago

did you try using the dev version of fread?

data.table::update_dev_pkg()