Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

IDate conversion from "YYYY-MM-DD" is slow #2503

Open aadler opened 6 years ago

aadler commented 6 years ago

@mattdowle, this time I'm using an issue and not twitter :)

The most recent version of data.table (built 11-23, IIRC) is freezing once again at IDate conversion. The call is:

 DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]

I'm sorry I don't have session info and traces but I had to downgrade to 1.10.4-3 to get the script to run and I'm up against a time limit. Then again, what happens is that it freezes at the conversion and just doesn't proceed.

MichaelChirico commented 6 years ago

Could you add more detail? That works fine for me:

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-11-27 22:38:43 UTC; travis
#   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#   Release notes, videos and slides: http://r-datatable.com]
set.seed(20349)
NN = 100
DT = data.table(Month = sprintf('%04d-%02d-%02d',
                                sample(2500, NN, TRUE),
                                sample(12, NN, TRUE), 
                                sample(28, NN, TRUE)))
DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
aadler commented 6 years ago

When I read in a file of 53,945,186 x 1 with only the date field, it works fine. But when I read in the other needed columns so the file is 53,945,186 rows x 14 columns, it freezes. Unless there is something else I've done, which I doubt, as a sanitized version of the code looks like this:

message(paste0('Now reading ', cur_year))
    DT <- fread(file_name, colClasses = colClass_Z, header = TRUE, select = select_Z, key = c('ID', 'Month'))
    d_rows <- d_rows + nrow(DT)
    if (!is.null(t_subset) || !is.null(v_subset) || !is.null(f_subset)){
      DT <- DT[ID %chin% valid_IDs]
    }
    message(paste0('Now converting dates in ', cur_year))
    DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]

and it's been way longer than necessary to convert the dates as per the 1-column test.

The code works just fine in 1.10.4-3, albeit much more slowly.

MichaelChirico commented 6 years ago

Can you reproduce on a subset of the data? Can you share a smaller version of a guilty file (perhaps anonymized)?

mattdowle commented 6 years ago

Thanks for reporting. Can you share the type of the column (e.g. output of str(DT)) as well please. Also, please set options(datatable.verbose=TRUE) and paste all the ouput here. How long does the fread call take, for example (if the progress meter is kicking in, and you're running in RStudio)? What's output of sessionInfo()?

aadler commented 6 years ago

Oh dear.

I tried importing the original file to see what I could cut out, and fell into stack imbalance/unprotected pointer issues multiple times.

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-27 22:38:43 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> DT <- fread('2017-11-22_1999_Performance.csv', colClasses = colClass_Fred_P, header = TRUE, select = select_col_P, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file 2017-11-22_1999_Performance.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51101071551015107111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 511010715510151071111111111117711111777110  Quote rule 0
  Type codes (jump 008)    : 511010755510151071111111111117711111777110  Quote rule 0
  Type codes (jump 009)    : 51101075551010510711015555555717711111777510  Quote rule 0
  Type codes (jump 042)    : 55101075551010510711015555555717711111777510  Quote rule 0
  Type codes (jump 064)    : 551010755510105107110110555555717711111777510  Quote rule 0
  Type codes (jump 100)    : 551010755510105107110110555555717711111777510  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 4 type and 23 drop user overrides : 001010700000000000000000070775555077750
[10] Allocate memory for the datatable
  Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
Read 6%. ETA 00:00 Error in fread("2017-11-22_1999_Performance.csv", colClasses = colClass_Fred_P,  : 
  unprotect_ptr: pointer not found
aadler commented 6 years ago

The most recent fix for the stack imbalance issue in #2481 seems to have fixed the IDate conversion freeze as well.

> colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+               rep('integer', 3L), rep('character', 2L),
+               'integer', 'Date', rep('numeric', 2L), 'Date',
+               rep('numeric', 12L), rep('integer', 5),
+               rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = 'Month', key = 'Month', header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 40 threads (omp_get_max_threads()=40, nth=40)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file LargeFile.csv
  File opened, size = 6.355GB (6823372783 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 100 lines of 37 fields using quote rule 0
  Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
  Type codes (jump 000)    : 51AA7155A15A7111111111111771111177715  Quote rule 0
  Type codes (jump 001)    : 51AA7155A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 008)    : 51AA7555A15A711111111111177111117771A  Quote rule 0
  Type codes (jump 009)    : 51AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 042)    : 55AA7555AA5A71A155555557177111117775A  Quote rule 0
  Type codes (jump 064)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  Type codes (jump 100)    : 55AA7555AA5A71A1A5555557177111117775A  Quote rule 0
  =====
  Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 6823372781
  Line length: mean=126.15 sd=8.30 min=100 max=359
  Estimated number of rows: 6823372781 / 126.15 = 54088821
  Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 36 drop user overrides : 000A000000000000000000000000000000000
[10] Allocate memory for the datatable
  Allocating 1 column slots (37 - 36 dropped) with 62279495 rows
[11] Read the data
  jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 1 columns from 6.355GB (6823372783 bytes) file in 00:05.951 wall clock time
[12] Finalizing the datatable
  Type counts:
        36 : drop      '0'
         1 : string    'A'
=============================
   0.000s (  0%) Memory map 6.355GB file
   0.032s (  1%) sep=',' ncol=37 and header detection
   0.001s (  0%) Column type detection using 10049 sample rows
   0.355s (  6%) Allocation of 62279495 rows x 37 cols (0.464GB) of which 53945186 ( 87%) rows used
   5.563s ( 93%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
   =    0.097s (  2%) Finding first non-embedded \n after each jump
   +    1.396s ( 23%) Parse to row-major thread buffers (grown 0 times)
   +    2.108s ( 35%) Transpose
   +    1.962s ( 33%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   5.951s        Total
> str(DT)
Classes ‘data.table’ and 'data.frame':  53945186 obs. of  1 variable:
 $ Month: chr  "1999-02-01" "1999-02-01" "1999-02-01" "1999-02-01" ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "Month"
> system.time(DT[, Month := as.IDate(Month, format = "%Y-%m-%d")])
   user  system elapsed 
  43.55    1.69   45.26 
> str(DT)
Classes ‘data.table’ and 'data.frame':  53945186 obs. of  1 variable:
 $ Month: IDate, format: "1999-02-01" "1999-02-01" "1999-02-01" "1999-02-01" ...
 - attr(*, ".internal.selfref")=<externalptr> 
mattdowle commented 6 years ago

Thanks @aadler. Ok this gives us enough to work on. Yeah 45s for that conversion pretty slow! Will look into what's going wrong.

MichaelChirico commented 6 years ago

@mattdowle that's similar to the times I'm seeing on similarly huge data sets (53M rows)... I guess this is a duplicate of #1451 if that's still slow 😬

c.f.

library(microbenchmark)
kk = 20L
sizes = 10^seq(1, 7.7, length.out = kk)

times = numeric(kk)
for (size_i in seq_along(sizes)) {
  cat(size_i, '\n')
  size = sizes[size_i]
  times[size_i] =
    mean(replicate(10, {
      charv = 
        sprintf('%s-%02d-%02d', 
                sample(2000, size, TRUE),
                sample(12, size, TRUE),
                sample(28, size, TRUE))
      t0 = get_nanotime()
      as.Date(charv)
      get_nanotime() - t0
    }))
}

png('~/Desktop/parse_scaling')
plot(sizes, times, log = 'xy',
     xlab = 'Length of Vector',
     ylab = 'Average Parse Time',
     type = 'l', lwd = 3L,
     main = 'Scaling of Time for as.Date.character')
dev.off()

parse_scaling

Typical time at 50,000,000 entries on my machine is roughly 3 minutes. So the bottleneck is real as.Date.

aadler commented 6 years ago

as.IDate converts char to Date and then to IDate?

MichaelChirico commented 6 years ago

yes, but the latter step is substantially cheaper (same as converting integer stored as numeric to integer stored as integer). it's duct tape, but IIUC there was no motivation to re-write C-level parsing API initially.

On Dec 6, 2017 12:40 AM, "Avraham Adler" notifications@github.com wrote:

as.IDate converts char to Date and then to IDate?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/2503#issuecomment-349362732, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQQdSJZSSiCAps5RP-DH9mfjl79sDaOks5s9XHrgaJpZM4QsNfL .

MichaelChirico commented 6 years ago

e.g.:

NN = 1e6
charv = sprintf('%d-%02d-%02d', sample(2000, NN, TRUE), sample(12, NN, TRUE), sample(28, NN, TRUE))
system.time(x <- as.Date(charv))
#    user  system elapsed 
#   2.380   0.224   2.604 
system.time(y <- as.IDate(x))
#    user  system elapsed 
#   0.004   0.000   0.005 

(scales roughly linearly -- with NN = 1e8):

system.time(x <- as.Date(charv))
#    user  system elapsed 
# 226.596  23.200 249.685 
system.time(y <- as.IDate(x))
#    user  system elapsed 
#   0.436   0.052   0.484 
MichaelChirico commented 6 years ago

@mattdowle random idea -- what do you think about keeping a lookup table to speed up char-to-IDate conversions? Basically storing this internally (or upon user request):

dates = seq.Date(as.Date('1900-01-01'),
                 as.Date('2099-12-31'), by = 'day')
date_lookup = data.table(
  date_char = as.character(dates),
  date_int = as.IDate(dates),
  key = 'date_char'
)
format(object.size(date_lookup), 'Mb')
# [1] "4.7 Mb"
nrow(date_lookup)
# [1] 73049

Example usage:

set.seed(3082)
NN = 1e8
smp_dt = copy(date_lookup[sample(.N, NN, TRUE)])

system.time(
  smp_dt[ , IDate_as := as.IDate(date_char)]
)
#    user  system elapsed 
# 104.584  25.264 129.783 

system.time(
  smp_dt[date_lookup, IDate_lookup := i.date_int, on = 'date_char']
)
#    user  system elapsed 
#   5.200   0.220   5.413 

(roughly 20x speed-up). Could also add columns like year, mday, yday, ... to make those functions faster as well.

And if the convertee happens to be keyed the factor goes to about 150x:

setkey(smp_dt, date_char)
system.time(
  smp_dt[date_lookup, IDate_keyed := i.date_int]
)
#    user  system elapsed 
#   0.668   0.104   0.771 

Downsides being: silent RAM usage (more an issue if we add the other columns), doesn't work for dates outside some range (200 years seems it would cover vast majority of use cases, and can easily revert to as.Date.character for the subset of unmatched elements), and it only works for %F-formatted input (I believe this is the most common by far, but could well be mistaken; could add a few other columns in other formats at the cost of more RAM, or just revert to strptime).

In any case, @aadler, you may want to use this in your case.

aadler commented 6 years ago

For anyone struggling with time conversions (which is an as.Date problem and not a data.table problem) if you can take advantage of @s-u (Simon Urbanek) fasttime package, you should see a speedup of multiple orders of magnitude.

MichaelChirico commented 6 years ago

@aadler at some point (see #587), there was some push to include fasttime to fread

st-pasha commented 6 years ago

NB: fasttime package is issued under a GPL-2 license, which means its code cannot be included in data.table, nor can it be used as a dependency.

That being said, fread would certainly benefit from being able to parse various date/time formats natively.

aadler commented 5 years ago

Is a clean room design of a fast parser possible?

jangorecki commented 5 years ago

@aadler not sure if it is necessary, could you please check timings on this branch https://github.com/Rdatatable/data.table/pull/3279 ?

MichaelChirico commented 4 years ago

Is it possible to reuse the date parser in fread.c for an as.IDate.character? as a first pass for the ISO case, then revert to the generalized parser if that "doesn't work"