Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

fwrite crashes when a long line is encountered out of sample #3289

Open st-pasha opened 5 years ago

st-pasha commented 5 years ago

MRE:

require(data.table)
DT = data.table(A=rep('foo', 10000))
DT[1111, A:=paste(rep('b',1e7), collapse='')]
DT[1112, A:=paste(rep('b',1e7), collapse='')]
fwrite(DT, 'temp.csv', verbose=TRUE)

For this error to occur, a single line has to be over 8MB (the default buffer's size). Or multiple lines (out of sample) have to have their total length over 8MB. Overall, the likelihood is quite rare

philippechataignon commented 5 years ago

This case is simpler than https://github.com/Rdatatable/data.table/issues/3290 because lines are seen in the initial sample (1000 lines). With PR https://github.com/Rdatatable/data.table/pull/3288, it gives :

> require(data.table)
> DT = data.table(A=rep('foo', 10000))
> DT[1111, A:=paste(rep('b',1e7), collapse='')]
> DT[1112, A:=paste(rep('b',1e7), collapse='')]
> fwrite(DT, 'temp.csv', verbose=TRUE)
omp_get_max_threads() = 4
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
Column writers: 11 
maxHeaderLen=2

args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=10000 args.ncol=1 eolLen=1
Error in fwrite(DT, "temp.csv", verbose = TRUE) : 
  Error : max line length is greater than buffer secure limit. Try to increase buffMB option. Example 'buffMB = 20'
st-pasha commented 5 years ago

PR #3288 merely changes the definition of which lines are sampled. In #3288, for a file with nrow=10000, every 11-th line will be sampled. Since 1111 = 11*101, the line 1111 is now in-sample (or maybe I'm off by 1). So perhaps modify the example to use different row numbers, and then we can verify whether the issue is indeed resolved.

philippechataignon commented 5 years ago

You're right : line has been sampled because fwrite stop in early stage.

In example below, lines aren't sampled because you see maxLineLen=7 from sample in debug. Now fwrite failed during the write phase but the buffer overflow has been caught (iike in https://github.com/Rdatatable/data.table/issues/3290).

> require(data.table)
> DT = data.table(A=rep('foo', 10000))
> DT[1234, A:=paste(rep('b',1e7), collapse='')]
> DT[1235, A:=paste(rep('b',1e7), collapse='')]
> fwrite(DT, 'temp.csv', verbose=TRUE)
omp_get_max_threads() = 4
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
Column writers: 11 
maxHeaderLen=2

args.doRowNames=0 args.rowNames=0 doQuote=-128 args.nrow=10000 args.ncol=1 eolLen=1
maxLineLen=7 from sample. Found in 0.000s
Writing column names ... done in 0.000s
Writing 10000 rows in 1 batches of 10000 rows (each buffer size 8MB, showProgress=1, nth=1) ... Error in fwrite(DT, "temp.csv", verbose = TRUE) : 
  Error : one or more threads failed to malloc or buffer was too small. Try to increase buffMB option. Example 'buffMB = 16'