Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

Segfault in pushBuffer() with a dataset having many columns and few rows #4257

Open jthiltges opened 4 years ago

jthiltges commented 4 years ago

Using R 3.6.1 and data.table 1.12.8, we see a segfault when fread()ing a dataset 4000000 columns by 300 rows and 3.4GB in size. pushBuffer() accesses str[c], which points to unallocated memory.

This seems to be a different issue than #3369, as it's an invalid memory read during the fread(), rather than a write. It also seems unusual that chunk_size is significantly larger than the total_size.

The issue can be reproduced with the following TSV data: The first row is a header. The following rows start with a string field, followed by ints.

H0000000 H0000001 ... H3999999
ABC 10 ... 10
... ... ... ...
ABC 10 ... 10

Reproducible example

library(data.table)
a = fread("example.csv", verbose=TRUE)

Running under gdb

$ R -d gdb
...
(gdb) run
...
R version 3.6.1 (2019-07-05) -- "Action of the Toes"
...
> library(data.table)
data.table 1.12.8 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> a = fread("example.csv", verbose=TRUE)
  omp_get_num_procs()            8
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          8
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 4 threads. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=8, nth=4)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file example.csv
  File opened, size = 3.386GB (3636000601 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<H0000000 H0000001    H0000002    H00>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=0x9  with 100 lines of 4000000 fields using quote rule 0
  Detected 4000000 columns on line 1. This line is either column names or first data row. Line starts as: <<H0000000    H0000001    H0000002    H00>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 4000000
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (3636000599 bytes from row 1 to eof) / (2 * 1224000199 jump0size) == 1
  Type codes (jump 000)    : A5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555  Quote rule 0
  Type codes (jump 001)    : A5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 150 sample rows
  =====
  Sampled 150 rows (handled \n inside quoted fields) at 2 jump points
  Bytes from first data row on line 2 to the end of last row: 3600000598
  Line length: mean=12000001.99 sd=-nan min=12000000 max=12000002
  Estimated number of rows: 3600000598 / 12000001.99 = 301
  Initial alloc = 331 rows (301 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : A5555555555555555555555555555555555555555555555555555555555555555555555555555555...5555555555
[10] Allocate memory for the datatable
  Allocating 4000000 column slots (4000000 - 0 dropped) with 331 rows
[11] Read the data
  jumps=[0..1), chunk_size=12000001986, total_size=3600000598

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x00002aaac55b0a49 in pushBuffer (ctx=0x7fffffff8060) at freadR.c:536
536 freadR.c: No such file or directory.
(gdb) bt
#0  0x00002aaac55b0a49 in pushBuffer (ctx=0x7fffffff8060) at freadR.c:536
#1  0x00002aaac55ade47 in freadMain._omp_fn.0 () at fread.c:2313
#2  0x00002aaaad580e92 in GOMP_parallel (fn=0x2aaac55ac7e2 <freadMain._omp_fn.0>, data=0x7fffffff8b60, num_threads=1, flags=0) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/parallel.c:171
#3  0x00002aaac55ab87c in freadMain (_args=...) at fread.c:1994
...

(gdb) print c
$1 = 0
(gdb) print strLen
$2 = 3
(gdb) print str[c]
Cannot access memory at address 0x2aaa48c2e367
(gdb) print anchor
$3 = 0x2aaac8bb0101 "ABC\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10\t10"...
(gdb) print source->off
$4 = -2146966938

Python to generate the example CSV

import csv

rows = 300
cols = 4000000

with open('example.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter="\t")

    header = []
    for i in range(0, cols):
        header.append('H%07d' % i)

    writer.writerow(header)

    for i in range(0, rows):
        row = [ 'ABC'] + [10] * (cols - 1)
        writer.writerow(row)

Output of sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux release 6.10 (Carbon)

Matrix products: default
BLAS/LAPACK: /util/opt/anaconda/deployed-conda-envs/packages/r/envs/r-3.6.1/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.1
MichaelChirico commented 4 years ago

Confirming segfault on macOS

``` R version 3.6.0 (2019-04-26) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.6.0 ```
snowpong commented 3 years ago

I see almost the exact same error. I am reading a wide .csv file of 3.2GB (around 450000 columns, a few hundred rows). R session aborts with fatal error. No errors of any kind.

Attaching to R with WinDbg I get the following error:

(1004.1168): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
datatable!pushBuffer+0x3d0:
00000000`6932aa20 803800          cmp     byte ptr [rax],0 ds:00000000`00d10133=??

and the following stack trace:

datatable!pushBuffer+0x3d0
datatable!wallclock+0x196d
datatable!dim+0x8512
datatable!freadMain+0x48ec
datatable!freadR+0x728
R!Rf_NewFrameConfirm+0x77c7
R!Rf_NewFrameConfirm+0x8694
R!R_initAssignSymbols+0x7969
R!Rf_eval+0x331
R!R_cmpfun1+0x508
R!Rf_applyClosure+0x16f
R!Rf_eval+0x2f2
R!Rf_ReplIteration+0x26c
R!Rf_ReplIteration+0x5c2
R!run_Rmainloop+0x52
Rterm+0x171c
Rterm+0x155a
Rterm+0x13e8
Rterm+0x151b
KERNEL32!BaseThreadInitThunk+0x14

My sessionInfo() is:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Norwegian Bokmål_Norway.1252
[2] LC_CTYPE=Norwegian Bokmål_Norway.1252
[3] LC_MONETARY=Norwegian Bokmål_Norway.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Norwegian Bokmål_Norway.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.13.2

loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3

Update: Tried reading the same .csv file with vroom - worked fine.

tpbilton commented 3 years ago

I am having similar issues with a dataset as @jthiltges: I have a wide dataset with the first column being character and the rest are numeric (246834 columns in total) and 1353 rows (+ the header).

Here is a reproducible example:

## generate data: (not the most efficient way but works)
simv = matrix(rnorm(333965049), nrow=1353)
id = sort(paste0("A",sample(10000:90000, size=1353)))
mat = data.frame(id, simv)
colnames(mat) = c("ID", paste0("Col",2:ncol(mat)))
write.csv(mat, file="testdata_fread.csv", row.names=F, quote=F)

Now, when I try to read in using fread

> dat = as.data.frame(fread("testdata_fread.csv", header=T, nThread=1, verbose=T))
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 1 threads (omp_get_max_threads()=4, nth=1)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /home/biltont/test/testdata_fread.csv
  File opened, size = 5.651GB (6067197263 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<id,Tag2,Tag3,Tag4,Tag5,Tag6,Ta>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 246834 fields using quote rule 0
  Detected 246834 columns on line 1. This line is either column names or first data row. Line starts as: <<id,Tag2,Tag3,Tag4,Tag5,Tag6,Ta>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 246834
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 1 because nrow limit (600) supplied
  Type codes (jump 000)    : C7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777  Quote rule 0
  All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : C7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777
[10] Allocate memory for the datatable
  Allocating 246834 column slots (246834 - 0 dropped) with 100 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=6064840029
  Too few rows allocated. Allocating additional 1024 rows (now nrows=600) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=6064840029

 *** caught segfault ***
address 0x7f0a7af7d595, cause 'memory not mapped'

Interestingly, it works if I read in a subset of the rows that is not too large,

dat = as.data.frame(fread("testdata_fread.csv", header=T, nThread=1, verbose=T, nrows=400))

but it seems that as soon as the dataset becomes too large, I get the segfault.

Session Info (Note: I'm using a conda environment):

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/biltont/conda-envs/testfread/lib/libopenblasp-r0.3.12.so

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.13.4

loaded via a namespace (and not attached):
[1] compiler_4.0.3

Does anyone know if there is a bug in fread or if this might be related to the hardware?

aitap commented 4 weeks ago

Looks like it's the same problem as in #5882 and #5311.