Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

caugth segfault with fread() and tables(), memory not mapped (with TRY plant trait database) #3369

Open gorne opened 5 years ago

gorne commented 5 years ago

###########################The problem I can not provide a reduced example because I do not know where the problem is. So I give you the entire database and e few lines where the problem appears. Few months ago I worked perfectly with the same database.

The database have 1.3 GB (1,345,078,529 bytes) The file database is "4636.txt" or "here"

library(data.table) data <- fread("4636.txt", header = T, sep = "\t", dec = ".", quote = "", data.table = T) tables()

It produces the following message:

caught segfault address (nil), cause 'memory not mapped'

Traceback: 1: structure(.Call(C_objectSize, x), class = "object_size") 2: object.size(DT) 3: set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) 4: FUN(X[[i]], ...) 5: lapply(DT_names, function(dt_n) { DT = get(dt_n, envir = env) info_i = data.table(NAME = dt_n, NROW = nrow(DT), NCOL = ncol(DT)) if (mb) set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) set(info_i, , "COLS", list(list(names(DT)))) set(info_i, , "KEY", list(list(key(DT)))) if (index) set(info_i, , "INDICES", list(list(indices(DT)))) info_i}) 6: rbindlist(lapply(DT_names, function(dt_n) { DT = get(dt_n, envir = env) info_i = data.table(NAME = dt_n, NROW = nrow(DT), NCOL = ncol(DT)) if (mb) set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) set(info_i, , "COLS", list(list(names(DT)))) set(info_i, , "KEY", list(list(key(DT)))) if (index) set(info_i, , "INDICES", list(list(indices(DT)))) info_i})) 7: tables()

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

###########################Output of sessionInfo() sessionInfo()

R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale: [1] LC_CTYPE=es_AR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_AR.UTF-8 LC_COLLATE=es_AR.UTF-8
[5] LC_MONETARY=es_AR.UTF-8 LC_MESSAGES=es_AR.UTF-8
[7] LC_PAPER=es_AR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.12.0

loaded via a namespace (and not attached): [1] compiler_3.5.2 tools_3.5.2

###########################But I also try with:

########################### When I ran the same script on valgrind I obtain: library(data.table) data <- fread("4636.txt", header = T, sep = "\t", dec = ".", quote = "", data.table = T)

==8341== Warning: set address range perms: large range [0x395d8000, 0x8989d000) (defined) -------------------------------------------------- ==================================================
===================================================8341== Thread 4:

==8341== Invalid write of size 4 ==8341== at 0xD59A949: memcpy (string3.h:53) ==8341== by 0xD59A949: pushBuffer (freadR.c:460) ==8341== by 0xD59090B: freadMain._omp_fn.0 (fread.c:2312) ==8341== by 0x6F2A43D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0) ==8341== by 0x54B56B9: start_thread (pthread_create.c:333) ==8341== by 0x57D241C: clone (clone.S:109) ==8341== Address 0xa4037f20 is 0 bytes after a block of size 10,563,296 alloc'd ==8341== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8341== by 0x4FA65EC: Rf_allocVector3 (in /usr/lib/R/lib/libR.so) ==8341== by 0xD598CF3: allocateDT (freadR.c:362) ==8341== by 0xD597E53: freadMain (fread.c:2380) ==8341== by 0xD5998B7: freadR (freadR.c:180) ==8341== by 0x4F27605: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F27D7C: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61867: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B760: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6FB23: ??? (in /usr/lib/R/lib/libR.so) ==8341== =| ==8341== Warning: set address range perms: large range [0x395d8000, 0x8989d000) (noaccess) Warning message: In fread("4636.txt", header = T, sep = "\t", dec = ".", quote = "", : Found and resolved improper quoting out-of-sample. First healed line 2640814: <<Atkin Owen 286 Global Respiration Database Acacia boriensis 200160 Acacia boboensis 2460341 23503718 40 Leaf photosynthesis rate per leaf dry mass 45 Photosynthesis per leaf dry mass (Amass) Am_sat 0.148726655348048 micro mol g-1 s-1 1.27766 Atkin OK, KJ Bloomfield, PB Reich, MG Tjoelker, GP Asner, D Bonal, G B�nisch, M Bradford, LA Cernusak, EG Cosio, D Creek, KY Crous, T Domingues, JS Dukes, JJG Egerton, JR Evans, GD Farquhar, NM Fyllas, PPG Gauthier, E Gloor, TE Gimeno, K. Griffin, R >>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

tables()

==8341== Thread 1: ==8341== Invalid read of size 8 ==8341== at 0x4FA2B0E: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4FA619C: Rf_allocVector3 (in /usr/lib/R/lib/libR.so) ==8341== by 0x5035A98: Rf_csduplicated (in /usr/lib/R/lib/libR.so) ==8341== by 0xACC82D7: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC874E: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC8DB5: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0x4F61E50: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6BEF8: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6C3F7: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61AB6: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== Address 0xa86bcdd0 is 0 bytes after a block of size 21,126,544 alloc'd ==8341== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8341== by 0x4FA65EC: Rf_allocVector3 (in /usr/lib/R/lib/libR.so) ==8341== by 0xD581B01: growVector (dogroups.c:487) ==8341== by 0xD598DB7: allocateDT (freadR.c:362) ==8341== by 0xD597E53: freadMain (fread.c:2380) ==8341== by 0xD5998B7: freadR (freadR.c:180) ==8341== by 0x4F27605: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F27D7C: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61867: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B760: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== ==8341== Invalid read of size 8 ==8341== at 0x502E4D9: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x502DEEA: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x5035B4D: Rf_csduplicated (in /usr/lib/R/lib/libR.so) ==8341== by 0xACC82D7: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC874E: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC8DB5: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0x4F61E50: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6BEF8: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6C3F7: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61AB6: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== Address 0x980cedd0 is 0 bytes after a block of size 21,126,544 alloc'd ==8341== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8341== by 0x4FA65EC: Rf_allocVector3 (in /usr/lib/R/lib/libR.so) ==8341== by 0xD581B01: growVector (dogroups.c:487) ==8341== by 0xD598DB7: allocateDT (freadR.c:362) ==8341== by 0xD597E53: freadMain (fread.c:2380) ==8341== by 0xD5998B7: freadR (freadR.c:180) ==8341== by 0x4F27605: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F27D7C: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61867: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B760: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== ==8341== Invalid read of size 8 ==8341== at 0xACC8658: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC874E: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC8DB5: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0x4F61E50: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6BEF8: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6C3F7: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61AB6: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F62E46: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== Address 0x980cedd0 is 0 bytes after a block of size 21,126,544 alloc'd ==8341== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==8341== by 0x4FA65EC: Rf_allocVector3 (in /usr/lib/R/lib/libR.so) ==8341== by 0xD581B01: growVector (dogroups.c:487) ==8341== by 0xD598DB7: allocateDT (freadR.c:362) ==8341== by 0xD597E53: freadMain (fread.c:2380) ==8341== by 0xD5998B7: freadR (freadR.c:180) ==8341== by 0x4F27605: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F27D7C: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61867: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B760: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== ==8341== Invalid read of size 1 ==8341== at 0xACC80C1: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC8A57: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC874E: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0xACC8DB5: ??? (in /usr/lib/R/library/utils/libs/utils.so) ==8341== by 0x4F61E50: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6BEF8: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6C3F7: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F61AB6: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6B7AF: Rf_eval (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F6D110: ??? (in /usr/lib/R/lib/libR.so) ==8341== by 0x4F62E46: ??? (in /usr/lib/R/lib/libR.so) ==8341== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==8341==

caught segfault address (nil), cause 'memory not mapped'

Traceback: 1: structure(.Call(C_objectSize, x), class = "object_size") 2: object.size(DT) 3: set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) 4: FUN(X[[i]], ...) 5: lapply(DT_names, function(dt_n) { DT = get(dt_n, envir = env) info_i = data.table(NAME = dt_n, NROW = nrow(DT), NCOL = ncol(DT)) if (mb) set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) set(info_i, , "COLS", list(list(names(DT)))) set(info_i, , "KEY", list(list(key(DT)))) if (index) set(info_i, , "INDICES", list(list(indices(DT)))) info_i}) 6: rbindlist(lapply(DT_names, function(dt_n) { DT = get(dt_n, envir = env) info_i = data.table(NAME = dt_n, NROW = nrow(DT), NCOL = ncol(DT)) if (mb) set(info_i, , "MB", round(as.numeric(object.size(DT))/1024^2)) set(info_i, , "COLS", list(list(names(DT)))) set(info_i, , "KEY", list(list(key(DT)))) if (index) set(info_i, , "INDICES", list(list(indices(DT)))) info_i})) 7: tables()

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

HughParsonage commented 5 years ago

Thank you for the report.

Confirmed for Windows too.

HughParsonage commented 5 years ago

The tables seems to be a red herring. One can reliably seg fault with verbose = TRUE:

omp_get_max_threads() = 12
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as integer
[02] Opening the file
  Opening file 4636.txt
  File opened, size = 1.253GB (1345078529 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<LastName FirstName   DatasetID   D>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 34 lines of 9 fields using quote rule 0
  sep=0x9  with 100 lines of 28 fields using quote rule 0
  Detected 28 columns on line 1. This line is either column names or first data row. Line starts as: <<LastName FirstName   DatasetID   D>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 28
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1345078527 bytes from row 1 to eof) / (2 * 55365 jump0size) == 12147
  Type codes (jump 000)    : AA5AA5A555A5AAAAA7A27A257AA2  Quote rule 0
  Type codes (jump 084)    : AA5AA5A555A5AAAAA7A57A257AA2  Quote rule 0
  Type codes (jump 100)    : AA5AA5A555A5AAAAA7A57A257AA2  Quote rule 0
  'header' determined to be true due to column 3 containing a string on row 1 and a lower type (int32) in the rest of the 10046 sample rows
  =====
  Sampled 10046 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1345078213
  Line length: mean=570.01 sd=203.15 min=173 max=1307
  Estimated number of rows: 1345078213 / 570.01 = 2359764
  Initial alloc = 4719528 rows (2359764 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AA5AA5A555A5AAAAA7A57A257AA2
[10] Allocate memory for the datatable
  Allocating 28 column slots (28 - 0 dropped) with 4719528 rows
[11] Read the data
  jumps=[0..1284), chunk_size=1047568, total_size=1345078213
  Restarting team from jump 1283. nSwept==0 quoteRule==1
  jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
  Restarting team from jump 1283. nSwept==0 quoteRule==2
  jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
  Restarting team from jump 1283. nSwept==0 quoteRule==3
  jumps=[1283..1284), chunk_size=1047568, total_size=1345078213
  jumps=[0..1284), chunk_size=1047568, total_size=1345078213
Read 2640813 rows x 28 columns from 1.253GB (1345078529 bytes) file in 00:04.754 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '2'
         9 : int32     '5'
         3 : float64   '7'
        15 : string    'A'
=============================
   0.001s (  0%) Memory map 1.253GB file
   0.012s (  0%) sep='\t' ncol=28 and header detection
   0.000s (  0%) Column type detection using 10046 sample rows
   0.701s ( 15%) Allocation of 4719528 rows x 28 cols (0.809GB) of which 2640813 ( 56%) rows used
   4.041s ( 85%) Reading 1284 chunks (0 swept) of 0.999MB (each chunk 2056 rows) using 12 threads
   +    0.437s (  9%) Parse to row-major thread buffers (grown 34 times)
   +    2.633s ( 55%) Transpose
   +    0.971s ( 20%) Waiting
   1.176s ( 25%) Rereading 1 columns due to out-of-sample type exceptions
   4.754s        Total
Column 23 ("RelUncertaintyPercent") bumped from 'bool8' to 'int32' due to <<50>> on row 665348
gorne commented 5 years ago

Dear HughParsonage, Do you know how to fix the problem? I posted the tables() error message as an example but the segfault error appears with almost all operations like: tail(data) or table(data$UnitName) or TRYdata2 <- TRYdata[!is.na(TRYdata$TraitID),]

So, I can not work with the database and just a few months ago it works perfectly. Any help will be appreciated

HughParsonage commented 5 years ago

Hi gorne,

No sorry I don't know how to fix. It's definitely a bug so thank you for reporting. Hopefully someone with better experience with the fread C code can fix. Once the maintainers have the time to put their mind to it, a fix will be applied.

You may wish to consider the following temporary fix (noting the warning about the blank penultimate line).

fread("4636.txt",
      sep = "\t",
      colClasses = list("integer" = c("RelUncertaintyPercent", "Replicates")),
      quote = "")
gorne commented 5 years ago

Thank you very much, it seems to work.

st-pasha commented 5 years ago

@HughParsonage For me your command emits the following warning:

Warning message:
In fread("~/Downloads/4636.txt", sep = "\t", colClasses = list(integer = c("RelUncertaintyPercent",  :
  Discarded single-line footer: <<Atkin Owen    286 Global Respiration Database Acacia boriensis    200160  Acacia boboensis    2460341 23503718    40  Leaf photosynthesis rate per leaf dry mass  45  Photosynthesis per leaf dry mass (Amass)    Am_sat                          0.148726655348048   micro mol g-1 s-1           1.27766 Atkin OK, KJ Bloomfield, PB Reich, MG Tjoelker, GP Asner, D Bonal, G B?nisch, M Bradford, LA Cernusak, EG Cosio, D Creek, KY Crous, T Domingues, JS Dukes, JJG Egerton, JR Evans, GD Farquhar, NM Fyllas, PPG Gauthier, E Gloor, TE Gimeno, K. Griffin, R >>

even though that last line is not a "single-line footer" but a real data record. As much as I can see it's valid too, it even has same 28 fields as the rest of the file.

That line can be read separately via

> fread("tail -n 1 ~/Downloads/4636.txt", sep='\t', header=FALSE)

and then rbind-ed to the main data.table.

drewabbot commented 5 years ago

The following code samples seem to show that subset is also vulnerable to this segfault bug. Toggle code params accordingly. Rolling back to version 1.11.8 fixes this particular problem under a few environments tested.

subset.segfault.zip

mattdowle commented 5 years ago

PR #3469 (v1.12.2) catches the malformed data.table and avoid the segfaults. PR #3471 (v1.12.4) unpacks the data.frame column that is a data.frame to solve the malformed data.table which causes the problem @drewabbot reported just above.

I reproduced and traced the fread problem rereading the last line of the file. I couldn't see a quick fix so will postpone from this release since that's a relatively rare case.

OfekShilon commented 2 years ago

@gorne This doesn't reproduce for me on 1.14.5 dev. Can you please verify that it is fixed? (note @jangorecki)