Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.63k stars 987 forks source link

Just use base functionality for `isoweek`? #5111

Open MichaelChirico opened 3 years ago

MichaelChirico commented 3 years ago

Happened to notice that format((Date), "%V") gives identical results to our isoweek, and it's a decent amount faster. Should we just migrate our backend to use base?

e.g.

x = rep(Sys.Date() - 0:5000, 1000)

system.time(out1 <- isoweek(x))
#    user  system elapsed 
# 12.428  21.093  34.376 
system.time(out2 <- as.integer(format(x, '%V')))
#    user  system elapsed 
#   4.346   0.221   4.685 

identical(out1, out2)
# [1] TRUE

Or should we try and optimize it better?

mattdowle commented 3 years ago

First thought is that investigating the root cause (which seems to be as.IDate) would fix more than isoweek, and then we could see where we are with isoweek after that. Possible related: #2503

> x = rep(Sys.Date() - 0:5000, 1000)
> system.time(out2 <- as.integer(format(x, '%V')))
   user  system elapsed 
  1.316   0.000   1.316 
> system.time(out1 <- isoweek(x))
   user  system elapsed 
  4.962   2.903   7.866    # aside: surprised my timings appear 4x faster than yours
> Rprof()
> system.time(out1 <- isoweek(x))
   user  system elapsed 
  4.701   2.984   7.685 
> Rprof(NULL)
> summaryRprof()
$by.self
                  self.time self.pct total.time total.pct
"strptime"             6.08    78.96       6.08     78.96
"format.POSIXlt"       0.76     9.87       0.76      9.87
"as.Date.POSIXlt"      0.40     5.19       0.40      5.19
"as.POSIXlt.Date"      0.32     4.16       0.32      4.16
"%/%"                  0.04     0.52       0.04      0.52
"is.na<-.default"      0.04     0.52       0.04      0.52
"-.IDate"              0.02     0.26       0.02      0.26
"gc"                   0.02     0.26       0.02      0.26
"setattr"              0.02     0.26       0.02      0.26

$by.total
                    total.time total.pct self.time self.pct
"system.time"             7.70    100.00      0.00     0.00
"isoweek"                 7.68     99.74      0.00     0.00
"as.IDate"                7.64     99.22      0.00     0.00
"as.IDate.default"        6.52     84.68      0.00     0.00
"as.Date.character"       6.50     84.42      0.00     0.00
"as.Date"                 6.50     84.42      0.00     0.00
"charToDate"              6.10     79.22      0.00     0.00
"strptime"                6.08     78.96      6.08    78.96
"format.Date"             1.08     14.03      0.00     0.00
"format"                  1.08     14.03      0.00     0.00
"format.POSIXlt"          0.76      9.87      0.76     9.87
"as.Date.POSIXlt"         0.40      5.19      0.40     5.19
"as.POSIXlt.Date"         0.32      4.16      0.32     4.16
"as.POSIXlt"              0.32      4.16      0.00     0.00
"%/%"                     0.04      0.52      0.04     0.52
"is.na<-.default"         0.04      0.52      0.04     0.52
"is.na<-"                 0.04      0.52      0.00     0.00
"-.IDate"                 0.02      0.26      0.02     0.26
"gc"                      0.02      0.26      0.02     0.26
"setattr"                 0.02      0.26      0.02     0.26
"+.IDate"                 0.02      0.26      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 7.7
zkamvar commented 3 years ago

Hello, I was lurking in the issues to see how y'all were fixing #5122 (I'm also one of the lucky ones to catch that bug) and I noticed this issue right below it and want to point out that "%V" may not be a recognized format on Windows (at least older systems anyways): https://stackoverflow.com/q/47509567

bastistician commented 3 years ago

"%V" may not be a recognized format on Windows (at least older systems anyways): https://stackoverflow.com/q/47509567

I think this information is outdated. The "%V" format string is widely supported since R 3.1.0, even on Windows. The corresponding NEWS item in R 3.1.0 was:

(On Windows, by default on OS X and optionally elsewhere.) The system C function strftime has been replaced by a more comprehensive version with closer conformance to the POSIX 2008 standard.