Open vronizor opened 2 years ago
I am experience the same problem in my codes. They were working until well from last year until 17 of February 2022. I have daily exports of data and it stopped on 17th of February, therefore this is the last day I could retrieve the data. I am using fread for a .gz R file. It would be nice to get some help regarding this issue.
wlogs <- fread("~/cron_scripts/data/petrel_wlogs_tvd_tvdgl_export_04.08.21.gz") -------------------------------------------------- ================================================== ================================================== Error in fread("~/cron_scripts/data/petrel_wlogs_tvd_tvdgl_export_04.08.21.gz") : R character strings are limited to 2^31-1 bytes
I solve my problem by changing this option of the data.table library:
library(data.table); setDTthreads(percent = 65)
Maybe this could help you as well.
@veritolilo which R version, which data.table version and which OS are you using (sessionInfo()
)?
hi @ben-schwen , I am using R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Ubuntu 20.04.3 LTS data.table version is 1.14.2
Hi,
same issue here with file 18M rows 15 cols, error:
Fehler in data.table::fread(file = file_path, stringsAsFactors = FALSE, :
R character strings are limited to 2^31-1 bytes
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
and data.table_1.14.2
Solution from @veritolilo : setDTthreads(percent = 65) didnt work for me
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 10 threads (omp_get_max_threads()=48, nth=10)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 0
0/1 column will be read as integer
[02] Opening the file
Opening file /srv/data/tmp/Rtmpflksnw/file2aea9a7a29d480
File opened, size = 20.59GB (22104323941 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
File ends abruptly with '1'. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<activityIndex14days,additional>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 340 fields using quote rule 0
Detected 340 columns on line 1. This line is either column names or first data row. Line starts as: <<activityIndex14days,additional>>
Quote rule picked = 0
fill=false and the most number of columns found is 340
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 100 because (22104323941 bytes from row 1 to eof) / (2 * 147887 jump0size) == 74733
Type codes (jump 000) : 52653233CCCCC3652222252222222222222263CCCC26CC53CCCCC3232233C5322322CCCCCC323256...555CC33333 Quote rule 0
Type codes (jump 001) : 52653233CCCCC3652CCC55C2C22C22C222C563CCCC26CC53CCCCC3232C33C5322322CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 002) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53223C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 004) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 006) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 008) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 010) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 020) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 026) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 072) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23C2CCCCCC323C56...555CC33333 Quote rule 0
Type codes (jump 090) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23CCCCCCCC323C56...555CC33333 Quote rule 0
A line with too-few fields (144/340) was found on line 40 of sample jump 100. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 100) : 52653233CCCCC3652CCC55C2C22C22C22CC563CCCC26CC53CCCCC3232C33C53C23CCCCCCCC323C56...555CC33333 Quote rule 0
=====
Sampled 10039 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 22104317786
Line length: mean=1586.72 sd=329.40 min=708 max=11787
Estimated number of rows: 22104317786 / 1586.72 = 13930808
Initial alloc = 23821120 rows (13930808 + 70%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 1 type and 325 drop user overrides : 00000000000000000000000000000000000000000000000000000000000000000000000000000000...0550000000
[10] Allocate memory for the datatable
Allocating 15 column slots (340 - 325 dropped) with 23821120 rows
[11] Read the data
jumps=[0..13930), chunk_size=1586813, total_size=22104317786
Restarting team from jump 13929. nSwept==0 quoteRule==1
jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
Restarting team from jump 13929. nSwept==0 quoteRule==2
jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
Restarting team from jump 13929. nSwept==0 quoteRule==3
jumps=[13929..13930), chunk_size=1586813, total_size=22104317786
1 out-of-sample type bumps: 00000000000000000000000000000000000000000000000000000000000000000000000000000000...0550000000
jumps=[0..13930), chunk_size=1586813, total_size=22104317786
Read 0 rows x 15 columns from 20.59GB (22104323941 bytes) file in 01:24.602 wall clock time
[12] Finalizing the datatable
Type counts:
325 : drop '0'
2 : bool8 '3'
5 : int32 '5'
3 : float64 '7'
5 : string 'C'
Fehler in data.table::fread(file = file_path, stringsAsFactors = FALSE, :
R character strings are limited to 2^31-1 bytes
For me the issue was that the csv file I was reading (|
delimited) included my delimiter in rows and hence it appears like certain rows have more than the expected number of columns.
data.table
seems to error out in these situations but I'm not really sure why. The fill=TRUE
option should, I think, handle this but I guess the issue is data.table
expects a given number of rows to start with and only later in the file it finds a row which has more rows than expected so it doesn't know to fill them?
readr::read_delim
and in Python pandas.read_csv()
both detect these and give you options to handle these. In readr
it simply leaves out everything which comes after the last expected delimiter - e.g. if expecting 10 columns and it finds a delimiter for an 11th column it just ignores it and gives you a warning. In pandas
it lets you skip these rows but it doesn't seem like you can "leave out" everything after the last expected delimiter.
For me the issue was that the csv file I was reading (
|
delimited) included my delimiter in rows and hence it appears like certain rows have more than the expected number of columns.
This was the correct cue for me. It seemed that some rows were missing a number of columns confusing data.table
. I simply removed those lines (that didn't have 57 columns in my case):
zcat dt.gz | awk -F '|' '(NF==57){print;}'
another possible occurence : https://stackoverflow.com/q/75438305/13513328
I've been trying to open this large CSV from the New York City Taxi Commission.
fread
errors out saying that I reached the character strings limit. I've posted a question on SO but the workaround does not cover all the problematic lines (i.e., I keep bumping into the same error even after discarding the first problematic line). I triedfill = T
but I still get incorrect readings. Interestingly, and contrary to the output provided by the answerer on SO, I do not get a "Stopped early on line 2958. Expected 18 fields but found 19." error, but it does appear to be related with the number of columns encountered on some lines.Seems related to #1812, #4130, possibly #5119.
Created on 2022-02-17 by the reprex package (v2.0.1)