Open jthiltges opened 4 years ago
Confirming segfault on macOS
I see almost the exact same error. I am reading a wide .csv file of 3.2GB (around 450000 columns, a few hundred rows). R session aborts with fatal error. No errors of any kind.
Attaching to R with WinDbg I get the following error:
(1004.1168): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
datatable!pushBuffer+0x3d0:
00000000`6932aa20 803800 cmp byte ptr [rax],0 ds:00000000`00d10133=??
and the following stack trace:
datatable!pushBuffer+0x3d0
datatable!wallclock+0x196d
datatable!dim+0x8512
datatable!freadMain+0x48ec
datatable!freadR+0x728
R!Rf_NewFrameConfirm+0x77c7
R!Rf_NewFrameConfirm+0x8694
R!R_initAssignSymbols+0x7969
R!Rf_eval+0x331
R!R_cmpfun1+0x508
R!Rf_applyClosure+0x16f
R!Rf_eval+0x2f2
R!Rf_ReplIteration+0x26c
R!Rf_ReplIteration+0x5c2
R!run_Rmainloop+0x52
Rterm+0x171c
Rterm+0x155a
Rterm+0x13e8
Rterm+0x151b
KERNEL32!BaseThreadInitThunk+0x14
My sessionInfo() is:
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Norwegian Bokmål_Norway.1252
[2] LC_CTYPE=Norwegian Bokmål_Norway.1252
[3] LC_MONETARY=Norwegian Bokmål_Norway.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Norwegian Bokmål_Norway.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.13.2
loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3
Update: Tried reading the same .csv file with vroom - worked fine.
I am having similar issues with a dataset as @jthiltges: I have a wide dataset with the first column being character and the rest are numeric (246834 columns in total) and 1353 rows (+ the header).
Here is a reproducible example:
## generate data: (not the most efficient way but works)
simv = matrix(rnorm(333965049), nrow=1353)
id = sort(paste0("A",sample(10000:90000, size=1353)))
mat = data.frame(id, simv)
colnames(mat) = c("ID", paste0("Col",2:ncol(mat)))
write.csv(mat, file="testdata_fread.csv", row.names=F, quote=F)
Now, when I try to read in using fread
> dat = as.data.frame(fread("testdata_fread.csv", header=T, nThread=1, verbose=T))
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 1 threads (omp_get_max_threads()=4, nth=1)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file /home/biltont/test/testdata_fread.csv
File opened, size = 5.651GB (6067197263 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<id,Tag2,Tag3,Tag4,Tag5,Tag6,Ta>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 246834 fields using quote rule 0
Detected 246834 columns on line 1. This line is either column names or first data row. Line starts as: <<id,Tag2,Tag3,Tag4,Tag5,Tag6,Ta>>
Quote rule picked = 0
fill=false and the most number of columns found is 246834
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 1 because nrow limit (600) supplied
Type codes (jump 000) : C7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777 Quote rule 0
All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : C7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777
[10] Allocate memory for the datatable
Allocating 246834 column slots (246834 - 0 dropped) with 100 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=6064840029
Too few rows allocated. Allocating additional 1024 rows (now nrows=600) and continue reading from jump 0
jumps=[0..1), chunk_size=1048576, total_size=6064840029
*** caught segfault ***
address 0x7f0a7af7d595, cause 'memory not mapped'
Interestingly, it works if I read in a subset of the rows that is not too large,
dat = as.data.frame(fread("testdata_fread.csv", header=T, nThread=1, verbose=T, nrows=400))
but it seems that as soon as the dataset becomes too large, I get the segfault.
Session Info (Note: I'm using a conda environment):
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /home/biltont/conda-envs/testfread/lib/libopenblasp-r0.3.12.so
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.13.4
loaded via a namespace (and not attached):
[1] compiler_4.0.3
Does anyone know if there is a bug in fread
or if this might be related to the hardware?
Looks like it's the same problem as in #5882 and #5311.
Using R 3.6.1 and data.table 1.12.8, we see a segfault when fread()ing a dataset 4000000 columns by 300 rows and 3.4GB in size. pushBuffer() accesses str[c], which points to unallocated memory.
This seems to be a different issue than #3369, as it's an invalid memory read during the fread(), rather than a write. It also seems unusual that chunk_size is significantly larger than the total_size.
The issue can be reproduced with the following TSV data: The first row is a header. The following rows start with a string field, followed by ints.
Reproducible example
Running under gdb
Python to generate the example CSV
Output of sessionInfo()