Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.52k stars 967 forks source link

Is there anything to do to speed up reading nanotime in fread #4377

Open statquant opened 4 years ago

statquant commented 4 years ago

Hello, I was trying to get a feel of how efficient fread is reading nanotime. I did the very naive comparaison bellow against kdb.

not reading nanotimes fread is approx matching kdb

R

library(data.table)
library(nanotime)

N <- 1e6 
set.seed(1)
l <- sample(letters, size = N, replace = TRUE)
w <- replicate(expr = paste(sample(letters, size = 5L), collapse = ""), n = N)
n <- nanotime("1970-01-01T00:00:00.000000001+00:00") + 30 * 365 * 86400 * 1e9 * abs(runif(N))
r <- rnorm(N)
dt <- data.table(l = l, w = w, n = n, r = r)
fwrite(dt, "/tmp/dt.txt")

system.time(
    dt2 <- fread("/tmp/dt.txt", showProgress = FALSE)                                  
)                                                             

First 5 runs gives the following

   user  system elapsed                                                                                                
  2.352   0.004   1.373                                                                                                
   user  system elapsed                                                                                                
  2.187   0.006   1.110                                                                                                
   user  system elapsed                                                                                                
  1.708   0.011   0.867                                                                                                
   user  system elapsed                                                                                                
  1.693   0.004   0.856                                                                                                
   user  system elapsed                                                                                                
  1.681   0.006   0.850                                                                                                

kdb

q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted

reading nanotimes fread is slower while kdb is approx. as fast as reading symbols

system.time(
    dt2 <- fread("/tmp/dt.txt", colClasses = c("n" = "nanotime"), showProgress = FALSE)
)                                                                                      

timings are:

   user  system elapsed                                                                                                
  2.127   0.001   1.260                                                                                                
   user  system elapsed                                                                                                
  2.368   0.004   1.383                                                                                                
   user  system elapsed                                                                                                
  2.312   0.006   1.346                                                                                                
   user  system elapsed                                                                                                
  2.357   0.011   1.381                                                                                                
   user  system elapsed                                                                                                
  2.313   0.006   1.351                                                                                                

kdb

q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redaced
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacred
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted

I know 5 runs is probably insufficient and that mmap is tricky so the above results might be useless but the point is: is there something that can be done on the user side to speed things up or is it just that nanotime is not as efficient as parsing strings than kdb is ?

session

R version 3.6.2 (2019-12-12)                                                                                           
Platform: x86_64-redhat-linux-gnu (64-bit)                                                                             
Running under: Fedora 31 (Workstation Edition)                                                                                                                                                                                                
Matrix products: default                                                                                               
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so                                                                              
attached base packages:                                                                                                
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                                                                                          
other attached packages:                                                                                               
[1] nanotime_0.2.4.5.3 data.table_1.12.9  nvimcom_0.9-83                                                                                                                                                                                    
loaded via a namespace (and not attached):                                                                             
[1] zoo_1.8-7       bit_1.1-15.2    compiler_3.6.2  tools_3.6.2     RcppCCTZ_0.2.7  Rcpp_1.0.4.6    bit64_0.9-7        
[8] grid_3.6.2      lattice_0.20-38                                                                                    
jangorecki commented 4 years ago

Not addressing your question at all but just wanted to mention that. Please check if publishing benchmarks of kdb is not conflicting with their license. Many (if not most) of closed source project unfortunately have this kind of restriction in their license. As a result we are unable to publish data.table benchmarks against them.

jangorecki commented 4 years ago

Addressing your question. You could try to unclass nanotime before writing to csv, and apply class back after reading from csv into R. This is what #1656 is about.

statquant commented 4 years ago

Hello @jangorecki kdb notoriously forbids it (as I guess you knew) but given they do similar things against data.table (read https://kx.com/blog/kdb-interface-r/) I think it is only fair... anyway I redacted results.

statquant commented 4 years ago

@jangorecki thanks for your suggestion, I would indeed write int64 if the only reader was R but I have several readers (kdb is one of them) so this is unfortunately out of the question. Does fread plan to have a special parser for nanotime (given your suggestion I am guessing there is a call to nanotime somewhere in fread.[R|c]) ?

MichaelChirico commented 4 years ago

@statquant can you try your timings again on the fread-iso8601 branch? Something tells me we won't be able to get the precision you're after with double storage, in any case.

statquant commented 4 years ago

Not sure what magic you did but it is now 3x faster to load in R on my laptop (0.7s vs 2.1) and much faster (> 2x) than "the one who must not be named" (for loading up to millisec POSIXct resolution it is good enough)

MichaelChirico commented 4 years ago

Love to hear it! 😎

jangorecki commented 4 years ago

@statquant I just looked at the link of kdb site you put. I don't know what "R experts" they have (Louise Totten?), but on their benchmarks they do benchmark as.data.frame rather then the actual operation in the question.

statquant commented 4 years ago

@MichaelChirico sorry if I seem cheaky but given what you've done for POSIXct would pushing towards nanosecs and casting to nanotime require a lot of additional work ?

jangorecki commented 4 years ago

@MichaelChirico @statquant I thought the same, that would address this issue well.

jangorecki commented 4 years ago

I don't mind others breaking license agreement, but the fact is that once a company can claim a loss due to a practice that breaks license agreement, then they could easily win the case in court. Deciding factor is probably a matter how much lawyers will cost and how much loss they can claim. Unfortunatelly it is a common practice among closed source software, applies to many other tools, kdb+ is just one of them.

DogDaodao commented 1 year ago

Hello everyone, I have a related performance problem, freading timestamp directly from a csv file to a "nanotime" object is slower than the default POSIXct object.

Here is the code and benchmark:

library(nanotime)
library(data.table)
library(microbenchmark)
N <- 10000
set.seed(1)

options("nanotimeFormat"="%Y-%m-%d %H:%M:%E9S")
timestamp <- nanotime("1970-01-01 00:00:00.00000000") +  30 * 365 * 86400 * 1e9 * abs(runif(N))
timestamp = as.character(timestamp)
dt <- data.table(timestamp = timestamp)
fwrite(dt, "~/nanotime.csv")

microbenchmark(
    fread.posixct = fread("~/nanotime.csv", sep = ","),
    fread.nanotime = fread("~/nanotime.csv", sep = ",", colClasses = c(timestamp = "nanotime"))
)

Unit: milliseconds
           expr      min       lq     mean   median       uq       max neval
  fread.posixct 1.507113 1.562507 1.661992 1.617307 1.692918  3.296795   100
 fread.nanotime 6.879102 7.264173 7.695965 7.486686 7.814365 11.739935   100

But using as.nanotime is faster than the as.POSIXct to convert timestamp strings.

timestamp.str = fread("~/nanotime.csv", sep = ",", tz = "")$timestamp
microbenchmark(
    as.posixct = as.POSIXct(timestamp.str, tz = "UTC"),
    as.nanottime = as.nanotime(timestamp.str, tz = "UTC")
)
Unit: milliseconds
         expr       min        lq      mean    median        uq       max neval
   as.posixct 13.134221 13.334536 14.244936 13.529305 14.255547 25.690105   100
 as.nanottime  1.886608  1.937244  2.095484  1.971293  2.059592  4.541754   100
sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4.9 data.table_1.14.4    nanotime_0.3.6      

loaded via a namespace (and not attached):
[1] zoo_1.8-10      bit_4.0.4       compiler_4.1.3  RcppCCTZ_0.2.11
[5] Rcpp_1.0.9      bit64_4.0.5     grid_4.1.3      lattice_0.20-45
jangorecki commented 1 year ago

To wrap up, this issue is about nanotime parser for fread. Maybe that part could be inside nanotime package at C level, and data.table would just call their C routine. Not sure how well that would fit, but it seems to be proper place for it.

DogDaodao commented 1 year ago

@jangorecki thanks for you reply, the suggestion you said reminded me that we should first test the performance bottleneck of parsing nanotime in fread(freadR.c and fread.r) and then to find the possible optimization.

here is my new test, cost time : fread.nanotime ≈ fread.str + as.nanotime

r$> microbenchmark(
        fread.posixct = fread("~/nanotime.csv", sep = ","),
        fread.nanotime = fread("~/nanotime.csv", sep = ",", colClasses = c(timestamp = "nanotime")),
        fread.str = fread("~/nanotime.csv", sep = ",", tz = ""),
        as.nanotime = as.nanotime(t.str$timestamp)
    )
Unit: milliseconds
           expr      min       lq     mean   median       uq       max neval
  fread.posixct 1.587341 1.658027 1.744677 1.693936 1.760134  2.559966   100
 fread.nanotime 7.423895 7.728727 8.165997 7.901187 8.156232 16.042013   100
      fread.str 1.820626 1.933135 2.085893 2.000488 2.131145  4.908817   100
    as.nanotime 5.207591 5.400650 5.756606 5.499481 5.687385 13.326389   100

I'm also not sure how to improve it. As shown above, the fread.posixct is also faster than the as.nanotime with string input.

jangorecki commented 1 year ago

Possibly fread.nanotime is str+as.nanotime under the hood, while posixct has its own parser.

DogDaodao commented 1 year ago

Possibly fread.nanotime is str+as.nanotime under the hood, while posixct has its own parser. I think so