filename <- 'myfile.csv'
table_first5e6rows <- fread(filename, nrows=1e7)
*** caught bus error ***
address 0x20000180e, cause 'invalid alignment'
Here is the actual output with the actual file, with verbose on:
> library(data.table)
data.table 1.14.10 using 8 threads (see ?getDTthreads). Latest news: r-datatable.com
> fread('/Users/dbg/Library/CloudStorage/Box-Box/tcsl/ngs_data/2023.09.12.illumina_28/out/merged_df.csv.gz', nrows=1e7, verbose=T)
OpenMP version (_OPENMP) 202011
omp_get_num_procs() 16
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 2147483647
omp_get_max_threads() 16
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 8 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=16, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file /var/folders/r9/rhdgkfwx0msg9ldkp_6ltypr0000gn/T//RtmpkxLujJ/file1018d1112866
File opened, size = 10.67GB (11454682052 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<itam_bc_o,itam_umi_o,costim_bc>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 94 fields using quote rule 0
Detected 94 columns on line 1. This line is either column names or first data row. Line starts as: <<itam_bc_o,itam_umi_o,costim_bc>>
Quote rule picked = 0
fill=false and the most number of columns found is 94
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because nrow limit (10000000) supplied
Type codes (jump 000) : CCCC755555555555555555555555555555555555555555555555555555555555555555555555555555555555555555 Quote rule 0
'header' determined to be true due to column 5 containing a string on row 1 and a lower type (float64) in the rest of the 100 sample rows
All rows were sampled since file is small so we know nrow=100 exactly
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : CCCC755555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
[10] Allocate memory for the datatable
Allocating 94 column slots (94 - 0 dropped) with 100 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=11454676303
Too few rows allocated. Allocating additional 11999900 rows (now nrows=10000000) and continue reading from jump 0
jumps=[0..1), chunk_size=1048576, total_size=11454676303
*** caught bus error ***
address 0x20000180e, cause 'invalid alignment'
Traceback:
1: fread("/Users/dbg/Library/CloudStorage/Box-Box/tcsl/ngs_data/2023.09.12.illumina_28/out/merged_df.csv.gz", nrows = 1e+07, verbose = T)
I have other similarly sized gzipped CSVs (including some that are 20% larger) that work fine, so it is something about the combination of this file, and asking nrows to load in a large number of rows.
I have large gzipped csv file with ~40 million rows. I can read the whole file into R fine:
I can also load about 5e6 rows into R fine:
But, when I load 10e6 rows, it crashes R:
Here is the actual output with the actual file, with verbose on:
I have other similarly sized gzipped CSVs (including some that are 20% larger) that work fine, so it is something about the combination of this file, and asking nrows to load in a large number of rows.
Here is a link to the file on box (700MB).
Here is my session info: