Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.61k stars 984 forks source link

fread error: R character strings are limited to 2^31-1 bytes #5338

Open vronizor opened 2 years ago

vronizor commented 2 years ago

I've been trying to open this large CSV from the New York City Taxi Commission. fread errors out saying that I reached the character strings limit. I've posted a question on SO but the workaround does not cover all the problematic lines (i.e., I keep bumping into the same error even after discarding the first problematic line). I tried fill = T but I still get incorrect readings. Interestingly, and contrary to the output provided by the answerer on SO, I do not get a "Stopped early on line 2958. Expected 18 fields but found 19." error, but it does appear to be related with the number of columns encountered on some lines.

Seems related to #1812, #4130, possibly #5119.

library(data.table) #development version installed
options(timeout=10000)

download.file("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv",
              destfile = "trip_data.csv", mode = "wb")

dt = fread("trip_data.csv", verbose = T)
#>   OpenMP version (_OPENMP)       201511
#>   omp_get_num_procs()            4
#>   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#>   R_DATATABLE_NUM_THREADS        unset
#>   R_DATATABLE_THROTTLE           unset (default 1024)
#>   omp_get_thread_limit()         2147483647
#>   omp_get_max_threads()          4
#>   OMP_THREAD_LIMIT               unset
#>   OMP_NUM_THREADS                unset
#>   RestoreAfterFork               true
#>   data.table is using 2 threads with throttle==1024. See ?setDTthreads.
#> freadR.c has been passed a filename: trip_data.csv
#> [01] Check arguments
#>   Using 2 threads (omp_get_max_threads()=4, nth=2)
#>   NAstrings = [<<NA>>]
#>   None of the NAstrings look like numbers.
#>   show progress = 0
#>   0/1 column will be read as integer
#> [02] Opening the file
#>   Opening file trip_data.csv
#>   File opened, size = 2.204GB (2366707460 bytes).
#>   Memory mapped ok
#> [03] Detect and skip BOM
#> [04] Arrange mmap to be \0 terminated
#>   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
#> [05] Skipping initial rows if needed
#>   Positioned on line 1 starting: <<vendor_id,pickup_datetime,drop>>
#> [06] Detect separator, quoting rule, and ncolumns
#>   Detecting sep automatically ...
#>   sep=','  with 98 lines of 18 fields using quote rule 0
#>   Detected 18 columns on line 3. This line is either column names or first data row. Line starts as: <<CMT,2010-03-22 17:05:03,2010-0>>
#>   Quote rule picked = 0
#>   fill=false and the most number of columns found is 18
#> [07] Detect column types, good nrow estimate and whether first row is column names
#>   Number of sampling jump points = 100 because (2366707206 bytes from row 1 to eof) / (2 * 18254 jump0size) == 64827
#>   Type codes (jump 000)    : DCC68886688D888888  Quote rule 0
#>   A line with too-many fields (18/18) was found on line 20 of sample jump 49. Most likely this jump landed awkwardly so type bumps here will be skipped.
#>   A line with too-many fields (18/18) was found on line 10 of sample jump 76. Most likely this jump landed awkwardly so type bumps here will be skipped.
#>   Type codes (jump 100)    : DCC68886688D888888  Quote rule 0
#>   'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
#>   =====
#>   Sampled 9877 rows (handled \n inside quoted fields) at 101 jump points
#>   Bytes from first data row on line 3 to the end of last row: 2366707034
#>   Line length: mean=183.98 sd=19.19 min=80 max=236
#>   Estimated number of rows: 2366707034 / 183.98 = 12863912
#>   Initial alloc = 16254771 rows (12863912 + 26%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
#>   =====
#> [08] Assign column names
#> [09] Apply user overrides on column types
#>   After 0 type and 0 drop user overrides : DCC68886688D888888
#> [10] Allocate memory for the datatable
#>   Allocating 18 column slots (18 - 0 dropped) with 16254771 rows
#> [11] Read the data
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707206
#>   Restarting team from jump 0. nSwept==0 quoteRule==1
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707206
#>   Restarting team from jump 0. nSwept==0 quoteRule==2
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707206
#>   Restarting team from jump 0. nSwept==0 quoteRule==3
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707206
#> Read 2955 rows x 18 columns from 2.204GB (2366707460 bytes) file in 00:02.041 wall clock time
#> [12] Finalizing the datatable
#>   Type counts:
#>          3 : int32     '6'
#>         11 : float64   '8'
#>          2 : float64   'C'
#>          2 : string    'D'
#> Error in fread("trip_data.csv", verbose = T): R character strings are limited to 2^31-1 bytes
dt = fread("trip_data.csv", fill = T, verbose = T)
#>   OpenMP version (_OPENMP)       201511
#>   omp_get_num_procs()            4
#>   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#>   R_DATATABLE_NUM_THREADS        unset
#>   R_DATATABLE_THROTTLE           unset (default 1024)
#>   omp_get_thread_limit()         2147483647
#>   omp_get_max_threads()          4
#>   OMP_THREAD_LIMIT               unset
#>   OMP_NUM_THREADS                unset
#>   RestoreAfterFork               true
#>   data.table is using 2 threads with throttle==1024. See ?setDTthreads.
#> freadR.c has been passed a filename: trip_data.csv
#> Warning in fread("trip_data.csv", fill = T, verbose = T): Previous fread()
#> session was not cleaned up properly. Cleaned up ok at the beginning of this
#> fread() call.
#> [01] Check arguments
#>   Using 2 threads (omp_get_max_threads()=4, nth=2)
#>   NAstrings = [<<NA>>]
#>   None of the NAstrings look like numbers.
#>   show progress = 0
#>   0/1 column will be read as integer
#> [02] Opening the file
#>   Opening file trip_data.csv
#>   File opened, size = 2.204GB (2366707460 bytes).
#>   Memory mapped ok
#> [03] Detect and skip BOM
#> [04] Arrange mmap to be \0 terminated
#>   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
#> [05] Skipping initial rows if needed
#>   Positioned on line 1 starting: <<vendor_id,pickup_datetime,drop>>
#> [06] Detect separator, quoting rule, and ncolumns
#>   Detecting sep automatically ...
#>   sep=','  with 18 fields using quote rule 0
#>   Detected 18 columns on line 1. This line is either column names or first data row. Line starts as: <<vendor_id,pickup_datetime,drop>>
#>   Quote rule picked = 0
#>   fill=true and the most number of columns found is 18
#> [07] Detect column types, good nrow estimate and whether first row is column names
#>   Number of sampling jump points = 100 because (2366707458 bytes from row 1 to eof) / (2 * 18506 jump0size) == 63944
#>   Type codes (jump 000)    : DCC68886688D888888  Quote rule 0
#>   A line with too-many fields (18/18) was found on line 21 of sample jump 49. Most likely this jump landed awkwardly so type bumps here will be skipped.
#>   A line with too-many fields (18/18) was found on line 10 of sample jump 76. Most likely this jump landed awkwardly so type bumps here will be skipped.
#>   Type codes (jump 100)    : DCC68886688D888888  Quote rule 0
#>   'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 9879 sample rows
#>   =====
#>   Sampled 9879 rows (handled \n inside quoted fields) at 101 jump points
#>   Bytes from first data row on line 2 to the end of last row: 2366707207
#>   Line length: mean=183.95 sd=19.28 min=1 max=236
#>   Estimated number of rows: 2366707207 / 183.95 = 12866150
#>   Initial alloc = 16278321 rows (12866150 + 26%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
#>   =====
#> [08] Assign column names
#> [09] Apply user overrides on column types
#>   After 0 type and 0 drop user overrides : DCC68886688D888888
#> [10] Allocate memory for the datatable
#>   Allocating 18 column slots (18 - 0 dropped) with 16278321 rows
#> [11] Read the data
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707207
#>   Restarting team from jump 0. nSwept==0 quoteRule==1
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707207
#>   Restarting team from jump 0. nSwept==0 quoteRule==2
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707207
#>   Restarting team from jump 0. nSwept==0 quoteRule==3
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707207
#>   1 out-of-sample type bumps: DCC68886688DD88888
#>   jumps=[0..2258), chunk_size=1048143, total_size=2366707207
#> Read 2956 rows x 18 columns from 2.204GB (2366707460 bytes) file in 00:01.500 wall clock time
#> [12] Finalizing the datatable
#>   Type counts:
#>          3 : int32     '6'
#>         10 : float64   '8'
#>          2 : float64   'C'
#>          3 : string    'D'
#> Error in fread("trip_data.csv", fill = T, verbose = T): R character strings are limited to 2^31-1 bytes

dt2955 = fread("trip_data.csv", nrows = 2955, verbose = T)
#>   OpenMP version (_OPENMP)       201511
#>   omp_get_num_procs()            4
#>   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#>   R_DATATABLE_NUM_THREADS        unset
#>   R_DATATABLE_THROTTLE           unset (default 1024)
#>   omp_get_thread_limit()         2147483647
#>   omp_get_max_threads()          4
#>   OMP_THREAD_LIMIT               unset
#>   OMP_NUM_THREADS                unset
#>   RestoreAfterFork               true
#>   data.table is using 2 threads with throttle==1024. See ?setDTthreads.
#> freadR.c has been passed a filename: trip_data.csv
#> Warning in fread("trip_data.csv", nrows = 2955, verbose = T): Previous fread()
#> session was not cleaned up properly. Cleaned up ok at the beginning of this
#> fread() call.
#> [01] Check arguments
#>   Using 2 threads (omp_get_max_threads()=4, nth=2)
#>   NAstrings = [<<NA>>]
#>   None of the NAstrings look like numbers.
#>   show progress = 0
#>   0/1 column will be read as integer
#> [02] Opening the file
#>   Opening file trip_data.csv
#>   File opened, size = 2.204GB (2366707460 bytes).
#>   Memory mapped ok
#> [03] Detect and skip BOM
#> [04] Arrange mmap to be \0 terminated
#>   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
#> [05] Skipping initial rows if needed
#>   Positioned on line 1 starting: <<vendor_id,pickup_datetime,drop>>
#> [06] Detect separator, quoting rule, and ncolumns
#>   Detecting sep automatically ...
#>   sep=','  with 98 lines of 18 fields using quote rule 0
#>   Detected 18 columns on line 3. This line is either column names or first data row. Line starts as: <<CMT,2010-03-22 17:05:03,2010-0>>
#>   Quote rule picked = 0
#>   fill=false and the most number of columns found is 18
#> [07] Detect column types, good nrow estimate and whether first row is column names
#>   Number of sampling jump points = 100 because nrow limit (2955) supplied
#>   Type codes (jump 000)    : DCC68886688D888888  Quote rule 0
#>   'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
#>   =====
#>   Sampled 101 rows (handled \n inside quoted fields) at 1 jump points
#>   Bytes from first data row on line 3 to the end of last row: 2366707034
#>   Line length: mean=184.28 sd=26.63 min=98 max=220
#>   Estimated number of rows: 2366707034 / 184.28 = 12843188
#>   Initial alloc = 18062772 rows (12843188 + 40%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
#>   =====
#>   Alloc limited to lower nrows=2955 passed in.
#> [08] Assign column names
#> [09] Apply user overrides on column types
#>   After 0 type and 0 drop user overrides : DCC68886688D888888
#> [10] Allocate memory for the datatable
#>   Allocating 18 column slots (18 - 0 dropped) with 2955 rows
#> [11] Read the data
#>   jumps=[0..1), chunk_size=1048576, total_size=2366707206
#> Read 2955 rows x 18 columns from 2.204GB (2366707460 bytes) file in 00:00.167 wall clock time
#> [12] Finalizing the datatable
#>   Type counts:
#>          3 : int32     '6'
#>         11 : float64   '8'
#>          2 : float64   'C'
#>          2 : string    'D'
#> =============================
#>    0.161s ( 97%) Memory map 2.204GB file
#>    0.001s (  1%) sep=',' ncol=18 and header detection
#>    0.000s (  0%) Column type detection using 101 sample rows
#>    0.000s (  0%) Allocation of 2955 rows x 18 cols (0.000GB) of which 2955 (100%) rows used
#>    0.004s (  2%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 2955 rows) using 1 threads
#>    +    0.003s (  2%) Parse to row-major thread buffers (grown 0 times)
#>    +    0.001s (  0%) Transpose
#>    +    0.000s (  0%) Waiting
#>    0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
#>    0.167s        Total

dt2956 = fread("trip_data.csv" skip = 2955, nrows = 10, verbose = T)
#>   OpenMP version (_OPENMP)       201511
#>   omp_get_num_procs()            4
#>   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#>   R_DATATABLE_NUM_THREADS        unset
#>   R_DATATABLE_THROTTLE           unset (default 1024)
#>   omp_get_thread_limit()         2147483647
#>   omp_get_max_threads()          4
#>   OMP_THREAD_LIMIT               unset
#>   OMP_NUM_THREADS                unset
#>   RestoreAfterFork               true
#>   data.table is using 2 threads with throttle==1024. See ?setDTthreads.
#> freadR.c has been passed a filename: /Users/vinceth/Documents/_Office/Uni/TCD/_research/2_bikes/bicycling-cleaner-cities/2_data/1_raw/taxis/tmp/2010-03_y_keep.csv
#> [01] Check arguments
#>   Using 2 threads (omp_get_max_threads()=4, nth=2)
#>   NAstrings = [<<NA>>]
#>   None of the NAstrings look like numbers.
#>   skip num lines = 2955
#>   show progress = 0
#>   0/1 column will be read as integer
#>   Opening file trip_data.csv
#>   File opened, size = 2.204GB (2366707460 bytes).
#>   Memory mapped ok
#> [03] Detect and skip BOM
#> [04] Arrange mmap to be \0 terminated
#>   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
#> [05] Skipping initial rows if needed
#>   Skipped to line 2956 in the file  Positioned on line 2956 starting: <<CMT,2010-03-07 22:59:01,2010-0>>
#> [06] Detect separator, quoting rule, and ncolumns
#>   Detecting sep automatically ...
#>   sep=','  with 2 lines of 18 fields using quote rule 0
#>   sep=' '  with 10 lines of 3 fields using quote rule 0
#>   Detected 3 columns on line 2956. This line is either column names or first data row. Line starts as: <<CMT,2010-03-07 22:59:01,2010-0>>
#>   Quote rule picked = 0
#>   fill=false and the most number of columns found is 3
#> [07] Detect column types, good nrow estimate and whether first row is column names
#>   Number of sampling jump points = 100 because nrow limit (10) supplied
#>   A line with too-many fields (3/3) was found on line 10 of sample jump 0. 
#>   Type codes (jump 000)    : DDD  Quote rule 0
#>   'header' determined to be true because all columns are type string and a better guess is not possible
#>   All rows were sampled since file is small so we know nrow=9 exactly
#> [08] Assign column names
#> [09] Apply user overrides on column types
#>   After 0 type and 0 drop user overrides : DDD
#> [10] Allocate memory for the datatable
#>   Allocating 3 column slots (3 - 0 dropped) with 9 rows
#> [11] Read the data
#>   jumps=[0..1), chunk_size=1048576, total_size=2366167046
#>   Restarting team from jump 0. nSwept==0 quoteRule==1
#>   jumps=[0..1), chunk_size=1048576, total_size=2366167046
#>   Restarting team from jump 0. nSwept==0 quoteRule==2
#>   jumps=[0..1), chunk_size=1048576, total_size=2366167046
#>   Restarting team from jump 0. nSwept==0 quoteRule==3
#>   jumps=[0..1), chunk_size=1048576, total_size=2366167046
#> Read 9 rows x 3 columns from 2.204GB (2366707460 bytes) file in 00:00.005 wall clock time
#> [12] Finalizing the datatable
#>   Type counts:
#>          3 : string    'D'
#> Error in fread("trip_data.csv", skip = 2955, nrows = 10, verbose = T): R character strings are limited to 2^31-1 bytes

lines = readLines("trip_data.csv")
lines[2955:2965]
#>  [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"            
#>  [2] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"
#>  [3] "CMT,2010-03-01 09:31:15,2010-03-01 09:38:48,1,1,-73.992148999999998,40.749791000000002,1,0,-73.992176999999998,40.738518999999997,Cre,6.0999999999999996,0,0.5,1,0,7.5999999999999996"                    
#>  [4] "CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9"                             
#>  [5] "CMT,2010-03-07 01:22:59,2010-03-07 01:27:19,1,0.69999999999999996,-73.982457999999994,40.735827999999998,1,0,-73.988750999999993,40.727192000000002,Cas,4.5,0.5,0.5,0,0,5.5"                              
#>  [6] "CMT,2010-03-06 18:00:42,2010-03-06 18:17:30,1,2.6000000000000001,-73.982532000000006,40.742524000000003,1,0,-73.990739000000005,40.716983999999997,Cre,10.5,0,0.5,2.2000000000000002,0,13.199999999999999"
#>  [7] "CMT,2010-03-07 11:49:52,2010-03-07 12:03:29,1,2.6000000000000001,-73.987218999999996,40.729304999999997,1,0,-73.989705999999998,40.757075,Cas,10.1,0,0.5,0,0,10.6"                                        
#>  [8] "CMT,2010-03-07 09:48:29,2010-03-07 09:52:34,2,1,-73.982061000000002,40.783313,1,0,-73.970888000000002,40.793388999999998,Cre,4.9000000000000004,0,0.5,0.81000000000000005,0,6.21"                         
#>  [9] "CMT,2010-03-06 19:47:56,2010-03-06 20:01:26,1,2.7999999999999998,-73.944688999999997,40.779978,1,0,-73.977885000000001,40.762656,Cas,10.5,0,0.5,0,0,11"                                                   
#> [10] "CMT,2010-03-07 12:04:35,2010-03-07 12:20:07,1,5.0999999999999996,-73.996803999999997,40.737890999999998,1,0,-73.972151999999994,40.794823000000001,Cre,14.9,0,0.5,2.3100000000000001,0,17.710000000000001"
#> [11] "CMT,2010-03-07 14:15:28,2010-03-07 14:20:24,2,1.3,-74.001530000000002,40.751294000000001,1,0,-73.993701000000001,40.767063999999998,Cas,5.7000000000000002,0,0.5,0,0,6.2000000000000002"
# there are more problematic lines down the line

df = read.csv("trip_data.csv")
setDT(df)

Created on 2022-02-17 by the reprex package (v2.0.1)

veritolilo commented 2 years ago

I am experience the same problem in my codes. They were working until well from last year until 17 of February 2022. I have daily exports of data and it stopped on 17th of February, therefore this is the last day I could retrieve the data. I am using fread for a .gz R file. It would be nice to get some help regarding this issue.

wlogs <- fread("~/cron_scripts/data/petrel_wlogs_tvd_tvdgl_export_04.08.21.gz") -------------------------------------------------- ==================================================
==================================================

Error in fread("~/cron_scripts/data/petrel_wlogs_tvd_tvdgl_export_04.08.21.gz") : R character strings are limited to 2^31-1 bytes

veritolilo commented 2 years ago

I solve my problem by changing this option of the data.table library:

library(data.table); setDTthreads(percent = 65)

Maybe this could help you as well.

ben-schwen commented 2 years ago

@veritolilo which R version, which data.table version and which OS are you using (sessionInfo())?

veritolilo commented 2 years ago

hi @ben-schwen , I am using R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Ubuntu 20.04.3 LTS data.table version is 1.14.2

dernapo commented 2 years ago

Hi,

same issue here with file 18M rows 15 cols, error:

Fehler in data.table::fread(file = file_path, stringsAsFactors = FALSE,  : 
  R character strings are limited to 2^31-1 bytes
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

and data.table_1.14.2

Solution from @veritolilo : setDTthreads(percent = 65) didnt work for me

Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 10 threads (omp_get_max_threads()=48, nth=10)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as integer
[02] Opening the file
  Opening file /srv/data/tmp/Rtmpflksnw/file2aea9a7a29d480
  File opened, size = 20.59GB (22104323941 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
  File ends abruptly with '1'. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<activityIndex14days,additional>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 340 fields using quote rule 0
  Detected 340 columns on line 1. This line is either column names or first data row. Line starts as: <<activityIndex14days,additional>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 340
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 100 because (22104323941 bytes from row 1 to eof) / (2 * 147887 jump0size) == 74733
  Type codes (jump 000)    : 52653233CCCCC3652222252222222222222263CCCC26CC53CCCCC3232233C5322322CCCCCC323256...555CC33333  Quote rule 0
  Type codes (jump 001)    : 52653233CCCCC3652CCC55C2C22C22C222C563CCCC26CC53CCCCC3232C33C5322322CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 002)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53223C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 004)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 006)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 008)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 010)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 020)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 026)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 072)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333  Quote rule 0
  Type codes (jump 090)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23CCCCCCCC323C56...555CC33333  Quote rule 0
  A line with too-few fields (144/340) was found on line 40 of sample jump 100. Most likely this jump landed awkwardly so type bumps here will be skipped.
  Type codes (jump 100)    : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23CCCCCCCC323C56...555CC33333  Quote rule 0
  =====
  Sampled 10039 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 22104317786
  Line length: mean=1586.72 sd=329.40 min=708 max=11787
  Estimated number of rows: 22104317786 / 1586.72 = 13930808
  Initial alloc = 23821120 rows (13930808 + 70%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 1 type and 325 drop user overrides : 00000000000000000000000000000000000000000000000000000000000000000000000000000000...0550000000
[10] Allocate memory for the datatable
  Allocating 15 column slots (340 - 325 dropped) with 23821120 rows
[11] Read the data
  jumps=[0..13930), chunk_size=1586813, total_size=22104317786
  Restarting team from jump 13929. nSwept==0 quoteRule==1
  jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
  Restarting team from jump 13929. nSwept==0 quoteRule==2
  jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
  Restarting team from jump 13929. nSwept==0 quoteRule==3
  jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
  1 out-of-sample type bumps: 00000000000000000000000000000000000000000000000000000000000000000000000000000000...0550000000
  jumps=[0..13930), chunk_size=1586813, total_size=22104317786
Read 0 rows x 15 columns from 20.59GB (22104323941 bytes) file in 01:24.602 wall clock time
[12] Finalizing the datatable
  Type counts:
       325 : drop      '0'
         2 : bool8     '3'
         5 : int32     '5'
         3 : float64   '7'
         5 : string    'C'
Fehler in data.table::fread(file = file_path, stringsAsFactors = FALSE,  : 
  R character strings are limited to 2^31-1 bytes
tbbarr commented 2 years ago

For me the issue was that the csv file I was reading (| delimited) included my delimiter in rows and hence it appears like certain rows have more than the expected number of columns.

data.table seems to error out in these situations but I'm not really sure why. The fill=TRUE option should, I think, handle this but I guess the issue is data.table expects a given number of rows to start with and only later in the file it finds a row which has more rows than expected so it doesn't know to fill them?

readr::read_delim and in Python pandas.read_csv() both detect these and give you options to handle these. In readr it simply leaves out everything which comes after the last expected delimiter - e.g. if expecting 10 columns and it finds a delimiter for an 11th column it just ignores it and gives you a warning. In pandas it lets you skip these rows but it doesn't seem like you can "leave out" everything after the last expected delimiter.

yan-foto commented 2 years ago

For me the issue was that the csv file I was reading (| delimited) included my delimiter in rows and hence it appears like certain rows have more than the expected number of columns.

This was the correct cue for me. It seemed that some rows were missing a number of columns confusing data.table. I simply removed those lines (that didn't have 57 columns in my case):

zcat dt.gz | awk -F '|' '(NF==57){print;}' 
Waldi73 commented 1 year ago

another possible occurence : https://stackoverflow.com/q/75438305/13513328