Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.56k stars 973 forks source link

Stack Imbalance error from fread #2457

Closed aadler closed 6 years ago

aadler commented 6 years ago

In the most recent version of data,table, I am trying to read a large file (all rows and 14 columns from a data table of size 53,880,721 x 37; 6.3GiB) and I got a string of stack imbalance errors which eventually results in RStudio crashing out. This seems to be the same symptom as issue #2139. Version 1.10.4.3 works without any problems, albeit orders of magnitude more slowly. It is hard to provide a reproducible example without a huge file. If necessary, I will create one.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-10-31 21:13:04 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

A <- fread('large_file.csv')
Read 28%. ETA 00:00 Error in fread("large_file.csv") : 
  unprotect_ptr: pointer not found
Warning: stack imbalance in '<-', 28 then 29
Warning: stack imbalance in '$', 34 then 33
Warning: stack imbalance in '$', 19 then 20
Error: unprotect_ptr: pointer not found
Warning: stack imbalance in 'lapply', 126 then 125
Warning: stack imbalance in 'lapply', 113 then 114
Warning: stack imbalance in 'lapply', 98 then 102
Warning: stack imbalance in 'lapply', 84 then 86
Warning: stack imbalance in 'lapply', 72 then 73
Warning: stack imbalance in 'lapply', 61 then 60
Warning: stack imbalance in 'lapply', 50 then 49
Warning: stack imbalance in '$', 49 then 50
Warning: stack imbalance in '$', 34 then 35
Error during wrapup: R_Reprotect: only 20 protected items, can't reprotect index 20

# Output of sessionInfo()

R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 
mattdowle commented 6 years ago

Thanks! Could you pass verbose=TRUE to that fread call, rerun and post the full output please.

aadler commented 6 years ago

Sure.

Stopped working at 82% instead of 28%. Continues after imbalance call and then I get an "Rstudio R Session has stopped working" error box.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-10-31 21:13:04 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> A <- fread('large_file.csv', verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file large_file.csv
  File opened, size = 6.347GB (6814991444 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 013)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 028)    : 5110107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 042)    : 5510107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 068)    : 55101075551010510711011111111117711111777110  Quote rule 0
  Type codes (jump 082)    : 55101075551010510711011111111177711111777110  Quote rule 0
  Type codes (jump 091)    : 55101075551010510711055555555777711111777510  Quote rule 0
  Type codes (jump 100)    : 55101075551010510711055555555777711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6814991442
  Line length: mean=126.51 sd=8.07 min=100 max=359
  Estimated number of rows: 6814991442 / 126.51 = 53867158
  Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
  Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 62%. ETA 00:00 Warning: stack imbalance in '$', 38 then 36
Warning: stack imbalance in '$', 23 then 24
Read 68%. ETA 00:00 Warning: stack imbalance in '$', 23 then 24
Read 82%. ETA 00:00  
mattdowle commented 6 years ago

Thanks for this output @aadler. Could you try once more, this time with showProgress=FALSE i.e. fread('large_file.csv', verbose = TRUE, showProgress=FALSE). All I can think currently is that it's related to printing the ETA to the console on Windows. Only thread 0 does that Rprintf which I thought was safe, but maybe it isn't. If it works with showProgress=FALSE but not with TRUE, and that's repeatable several times, then I know I'm barking up the right tree. Please also use latest dev 1.10.5 just to be sure.

mattdowle commented 6 years ago

In R's printutils.c, Rprintf calls Rvprintf which contains at line 917 :

static int printcount = 0;
if (++printcount > 100) {
    R_CheckUserInterrupt();
    printcount = 0 ;
}

and in freadR.c line 481 there is this comment :

// Had crashes with R_CheckUserInterrupt() even when called only from
  master thread, to overcome.

So barking up this tree looks promising. I'll replace the call to Rprintf() with REprinf() to avoid its call to R_CheckUserInterrupt() and ask you to try again.

It may be that it feels like it's something to do with fread rereading but that's also when it runs for longer and may just be because it prints more ETA messages reaching the 100 count in core R. There isn't a reread being reported in your output this time, for example. It may also depend on how many lines have been printed to the console before fread is called (printcount's value) giving rise to the randomness of the crash / stack imbalance.

mattdowle commented 6 years ago

@aadler Change made and looks ok. (It's just a red cross because the progress meter isn't getting test coverage from the smoke tests.) Please go ahead and try very latest dev.

aadler commented 6 years ago
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-09 04:24:28 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2

You were correct it seems!

> A <- fread('large_file.csv', verbose = TRUE, showProgress=FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as boolean
[02] Opening the file
  Opening file large_file.csv
  File opened, size = 6.347GB (6814991444 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 013)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 028)    : 5110107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 042)    : 5510107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 068)    : 55101075551010510711011111111117711111777110  Quote rule 0
  Type codes (jump 082)    : 55101075551010510711011111111177711111777110  Quote rule 0
  Type codes (jump 091)    : 55101075551010510711055555555777711111777510  Quote rule 0
  Type codes (jump 100)    : 55101075551010510711055555555777711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6814991442
  Line length: mean=126.51 sd=8.07 min=100 max=359
  Estimated number of rows: 6814991442 / 126.51 = 53867158
  Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
  Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:56.763 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         5 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        14 : int32    
         0 : int64    
        10 : float64  
         0 : float64  
         0 : float64  
         8 : string   
Rereading 2 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'int32' to 'string' due to <<C>> on row 1040
Column 14 ("XXXX") bumped from 'bool8' to 'float64' due to <<12473.21>> on row 77204
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 2 columns in 00:20.442
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
   0.000s (  0%) Memory map 6.347GB file
   0.549s (  1%) sep=',' ncol=37 and header detection
   0.016s (  0%) Column type detection using 10049 sample rows
  15.322s ( 20%) Allocation of 53880720 rows x 37 cols (12.192GB)
  61.318s ( 79%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
   =    0.057s (  0%) Finding first non-embedded \n after each jump
   +    3.851s (  5%) Parse to row-major thread buffers
   +   38.294s ( 50%) Transpose
   +   19.116s ( 25%) Waiting
  20.442s ( 26%) Rereading 2 columns due to out-of-sample type exceptions
  77.204s        Total

and a second time for testing

> cclasses <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> A <- fread('large_file.csv', verbose = TRUE, showProgress=FALSE, colClasses = cclasses)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as boolean
[02] Opening the file
  Opening file large_file.csv
  File opened, size = 6.347GB (6814991444 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 013)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 028)    : 5110107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 042)    : 5510107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 068)    : 55101075551010510711011111111117711111777110  Quote rule 0
  Type codes (jump 082)    : 55101075551010510711011111111177711111777110  Quote rule 0
  Type codes (jump 091)    : 55101075551010510711055555555777711111777510  Quote rule 0
  Type codes (jump 100)    : 55101075551010510711055555555777711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6814991442
  Line length: mean=126.51 sd=8.07 min=100 max=359
  Estimated number of rows: 6814991442 / 126.51 = 53867158
  Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 14 type and 0 drop user overrides : 55101075551010510771077777777777755555777510
[10] Allocate memory for the datatable
  Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:31.641 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         0 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        12 : int32    
         0 : int64    
        17 : float64  
         0 : float64  
         0 : float64  
         8 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'float64' to 'string' due to <<C>> on row 1040
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 1 columns in 00:33.233
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
   0.000s (  0%) Memory map 6.347GB file
   0.165s (  0%) sep=',' ncol=37 and header detection
   0.016s (  0%) Column type detection using 10049 sample rows
  10.867s ( 17%) Allocation of 53880720 rows x 37 cols (14.262GB)
  53.827s ( 83%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
   =    0.004s (  0%) Finding first non-embedded \n after each jump
   +    2.944s (  5%) Parse to row-major thread buffers
   +   18.906s ( 29%) Transpose
   +   31.973s ( 49%) Waiting
  33.233s ( 51%) Rereading 1 columns due to out-of-sample type exceptions
  64.874s        Total
aadler commented 6 years ago

@mattdowle I ran it three times without the showprogress FALSE and it completed!

> A <- fread('large_file.csv', verbose = TRUE, colClasses = cclasses)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file large_file.csv
  File opened, size = 6.347GB (6814991444 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 013)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 028)    : 5110107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 042)    : 5510107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 068)    : 55101075551010510711011111111117711111777110  Quote rule 0
  Type codes (jump 082)    : 55101075551010510711011111111177711111777110  Quote rule 0
  Type codes (jump 091)    : 55101075551010510711055555555777711111777510  Quote rule 0
  Type codes (jump 100)    : 55101075551010510711055555555777711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6814991442
  Line length: mean=126.51 sd=8.07 min=100 max=359
  Estimated number of rows: 6814991442 / 126.51 = 53867158
  Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 14 type and 0 drop user overrides : 55101075551010510771077777777777755555777510
[10] Allocate memory for the datatable
  Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:35.513 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         0 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        12 : int32    
         0 : int64    
        17 : float64  
         0 : float64  
         0 : float64  
         8 : string   
Rereading 1 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'float64' to 'string' due to <<C>> on row 1040
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083

[12] Finalizing the datatable
Reread 53880720 rows x 1 columns in 00:19.218
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
   0.006s (  0%) Memory map 6.347GB file
   0.162s (  0%) sep=',' ncol=37 and header detection
   0.016s (  0%) Column type detection using 10049 sample rows
  11.688s ( 21%) Allocation of 53880720 rows x 37 cols (14.262GB)
  42.859s ( 78%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
   =    0.023s (  0%) Finding first non-embedded \n after each jump
   +    3.818s (  7%) Parse to row-major thread buffers
   +   20.892s ( 38%) Transpose
   +   18.127s ( 33%) Waiting
  19.218s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
  54.731s        Total
> A <- fread('large_file.csv', verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file large_file.csv
  File opened, size = 6.347GB (6814991444 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 013)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 028)    : 5110107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 042)    : 5510107555101510711011111111117711111777110  Quote rule 0
  Type codes (jump 068)    : 55101075551010510711011111111117711111777110  Quote rule 0
  Type codes (jump 082)    : 55101075551010510711011111111177711111777110  Quote rule 0
  Type codes (jump 091)    : 55101075551010510711055555555777711111777510  Quote rule 0
  Type codes (jump 100)    : 55101075551010510711055555555777711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6814991442
  Line length: mean=126.51 sd=8.07 min=100 max=359
  Estimated number of rows: 6814991442 / 126.51 = 53867158
  Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
  Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 99%. ETA 00:00 
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:23.552 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
         0 : drop     
         5 : bool8    
         0 : bool8    
         0 : bool8    
         0 : bool8    
        14 : int32    
         0 : int64    
        10 : float64  
         0 : float64  
         0 : float64  
         8 : string   
Rereading 2 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'int32' to 'string' due to <<C>> on row 1040
Column 14 ("XXXX") bumped from 'bool8' to 'float64' due to <<12473.21>> on row 77204
[11] Read the data
  jumps=[0..6520), chunk_size=1045244, total_size=6814991083

[12] Finalizing the datatable
Reread 53880720 rows x 2 columns in 00:20.825
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
   0.000s (  0%) Memory map 6.347GB file
   0.173s (  0%) sep=',' ncol=37 and header detection
   0.016s (  0%) Column type detection using 10049 sample rows
   3.212s (  7%) Allocation of 53880720 rows x 37 cols (12.192GB)
  40.976s ( 92%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
   =    0.015s (  0%) Finding first non-embedded \n after each jump
   +    3.793s (  9%) Parse to row-major thread buffers
   +   18.783s ( 42%) Transpose
   +   18.384s ( 41%) Waiting
  20.825s ( 47%) Rereading 2 columns due to out-of-sample type exceptions
  44.377s        Total
> A <- fread('large_file.csv')
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:32.848 wall clock time
Rereading 2 columns due to out-of-sample type exceptions.
Reread 53880720 rows x 2 columns in 00:18.507
mattdowle commented 6 years ago

Relief! Thanks @aadler!