Open gorne opened 5 years ago
Thank you for the report.
Confirmed for Windows too.
The tables
seems to be a red herring. One can reliably seg fault with verbose = TRUE
:
omp_get_max_threads() = 12
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 0
0/1 column will be read as integer
[02] Opening the file
Opening file 4636.txt
File opened, size = 1.253GB (1345078529 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<LastName FirstName DatasetID D>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 34 lines of 9 fields using quote rule 0
sep=0x9 with 100 lines of 28 fields using quote rule 0
Detected 28 columns on line 1. This line is either column names or first data row. Line starts as: <<LastName FirstName DatasetID D>>
Quote rule picked = 0
fill=false and the most number of columns found is 28
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (1345078527 bytes from row 1 to eof) / (2 * 55365 jump0size) == 12147
Type codes (jump 000) : AA5AA5A555A5AAAAA7A27A257AA2 Quote rule 0
Type codes (jump 084) : AA5AA5A555A5AAAAA7A57A257AA2 Quote rule 0
Type codes (jump 100) : AA5AA5A555A5AAAAA7A57A257AA2 Quote rule 0
'header' determined to be true due to column 3 containing a string on row 1 and a lower type (int32) in the rest of the 10046 sample rows
=====
Sampled 10046 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 1345078213
Line length: mean=570.01 sd=203.15 min=173 max=1307
Estimated number of rows: 1345078213 / 570.01 = 2359764
Initial alloc = 4719528 rows (2359764 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : AA5AA5A555A5AAAAA7A57A257AA2
[10] Allocate memory for the datatable
Allocating 28 column slots (28 - 0 dropped) with 4719528 rows
[11] Read the data
jumps=[0..1284), chunk_size=1047568, total_size=1345078213
Restarting team from jump 1283. nSwept==0 quoteRule==1
jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
Restarting team from jump 1283. nSwept==0 quoteRule==2
jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
Restarting team from jump 1283. nSwept==0 quoteRule==3
jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
jumps=[0..1284), chunk_size=1047568, total_size=1345078213
Read 2640813 rows x 28 columns from 1.253GB (1345078529 bytes) file in 00:04.754 wall clock time
[12] Finalizing the datatable
Type counts:
1 : bool8 '2'
9 : int32 '5'
3 : float64 '7'
15 : string 'A'
=============================
0.001s ( 0%) Memory map 1.253GB file
0.012s ( 0%) sep='\t' ncol=28 and header detection
0.000s ( 0%) Column type detection using 10046 sample rows
0.701s ( 15%) Allocation of 4719528 rows x 28 cols (0.809GB) of which 2640813 ( 56%) rows used
4.041s ( 85%) Reading 1284 chunks (0 swept) of 0.999MB (each chunk 2056 rows) using 12 threads
+ 0.437s ( 9%) Parse to row-major thread buffers (grown 34 times)
+ 2.633s ( 55%) Transpose
+ 0.971s ( 20%) Waiting
1.176s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
4.754s Total
Column 23 ("RelUncertaintyPercent") bumped from 'bool8' to 'int32' due to <<50>> on row 665348
Dear HughParsonage,
Do you know how to fix the problem?
I posted the tables()
error message as an example but the segfault error appears with almost all operations like:
tail(data)
or
table(data$UnitName)
or
TRYdata2 <- TRYdata[!is.na(TRYdata$TraitID),]
So, I can not work with the database and just a few months ago it works perfectly. Any help will be appreciated
Hi gorne,
No sorry I don't know how to fix. It's definitely a bug so thank you for reporting. Hopefully someone with better experience with the fread C code can fix. Once the maintainers have the time to put their mind to it, a fix will be applied.
You may wish to consider the following temporary fix (noting the warning about the blank penultimate line).
fread("4636.txt",
sep = "\t",
colClasses = list("integer" = c("RelUncertaintyPercent", "Replicates")),
quote = "")
Thank you very much, it seems to work.
@HughParsonage For me your command emits the following warning:
Warning message:
In fread("~/Downloads/4636.txt", sep = "\t", colClasses = list(integer = c("RelUncertaintyPercent", :
Discarded single-line footer: <<Atkin Owen 286 Global Respiration Database Acacia boriensis 200160 Acacia boboensis 2460341 23503718 40 Leaf photosynthesis rate per leaf dry mass 45 Photosynthesis per leaf dry mass (Amass) Am_sat 0.148726655348048 micro mol g-1 s-1 1.27766 Atkin OK, KJ Bloomfield, PB Reich, MG Tjoelker, GP Asner, D Bonal, G B?nisch, M Bradford, LA Cernusak, EG Cosio, D Creek, KY Crous, T Domingues, JS Dukes, JJG Egerton, JR Evans, GD Farquhar, NM Fyllas, PPG Gauthier, E Gloor, TE Gimeno, K. Griffin, R >>
even though that last line is not a "single-line footer" but a real data record. As much as I can see it's valid too, it even has same 28 fields as the rest of the file.
That line can be read separately via
> fread("tail -n 1 ~/Downloads/4636.txt", sep='\t', header=FALSE)
and then rbind
-ed to the main data.table.
The following code samples seem to show that subset is also vulnerable to this segfault bug. Toggle code params accordingly. Rolling back to version 1.11.8 fixes this particular problem under a few environments tested.
PR #3469 (v1.12.2) catches the malformed data.table and avoid the segfaults. PR #3471 (v1.12.4) unpacks the data.frame column that is a data.frame to solve the malformed data.table which causes the problem @drewabbot reported just above.
I reproduced and traced the fread
problem rereading the last line of the file. I couldn't see a quick fix so will postpone from this release since that's a relatively rare case.
@gorne This doesn't reproduce for me on 1.14.5 dev. Can you please verify that it is fixed? (note @jangorecki)
###########################The problem I can not provide a reduced example because I do not know where the problem is. So I give you the entire database and e few lines where the problem appears. Few months ago I worked perfectly with the same database.
The database have 1.3 GB (1,345,078,529 bytes) The file database is "4636.txt" or "here"
library(data.table)
data <- fread("4636.txt", header = T, sep = "\t", dec = ".", quote = "", data.table = T)
tables()
It produces the following message:
###########################Output of sessionInfo()
sessionInfo()
###########################But I also try with:
########################### When I ran the same script on valgrind I obtain:
library(data.table)
data <- fread("4636.txt", header = T, sep = "\t", dec = ".", quote = "", data.table = T)
tables()