Open statquant opened 4 years ago
Not addressing your question at all but just wanted to mention that. Please check if publishing benchmarks of kdb is not conflicting with their license. Many (if not most) of closed source project unfortunately have this kind of restriction in their license. As a result we are unable to publish data.table benchmarks against them.
Addressing your question. You could try to unclass nanotime before writing to csv, and apply class back after reading from csv into R. This is what #1656 is about.
Hello @jangorecki kdb notoriously forbids it (as I guess you knew) but given they do similar things against data.table
(read https://kx.com/blog/kdb-interface-r/) I think it is only fair... anyway I redacted results.
@jangorecki thanks for your suggestion, I would indeed write int64 if the only reader was R but I have several readers (kdb is one of them) so this is unfortunately out of the question. Does fread plan to have a special parser for nanotime (given your suggestion I am guessing there is a call to nanotime somewhere in fread.[R|c]) ?
@statquant can you try your timings again on the fread-iso8601
branch? Something tells me we won't be able to get the precision you're after with double
storage, in any case.
Not sure what magic you did but it is now 3x faster to load in R on my laptop (0.7s vs 2.1) and much faster (> 2x) than "the one who must not be named" (for loading up to millisec POSIXct resolution it is good enough)
Love to hear it! 😎
@statquant I just looked at the link of kdb site you put. I don't know what "R experts" they have (Louise Totten?), but on their benchmarks they do benchmark as.data.frame
rather then the actual operation in the question.
@MichaelChirico sorry if I seem cheaky but given what you've done for POSIXct would pushing towards nanosecs and casting to nanotime require a lot of additional work ?
@MichaelChirico @statquant I thought the same, that would address this issue well.
I don't mind others breaking license agreement, but the fact is that once a company can claim a loss due to a practice that breaks license agreement, then they could easily win the case in court. Deciding factor is probably a matter how much lawyers will cost and how much loss they can claim. Unfortunatelly it is a common practice among closed source software, applies to many other tools, kdb+ is just one of them.
Hello everyone, I have a related performance problem, freading timestamp directly from a csv file to a "nanotime" object is slower than the default POSIXct object.
Here is the code and benchmark:
library(nanotime)
library(data.table)
library(microbenchmark)
N <- 10000
set.seed(1)
options("nanotimeFormat"="%Y-%m-%d %H:%M:%E9S")
timestamp <- nanotime("1970-01-01 00:00:00.00000000") + 30 * 365 * 86400 * 1e9 * abs(runif(N))
timestamp = as.character(timestamp)
dt <- data.table(timestamp = timestamp)
fwrite(dt, "~/nanotime.csv")
microbenchmark(
fread.posixct = fread("~/nanotime.csv", sep = ","),
fread.nanotime = fread("~/nanotime.csv", sep = ",", colClasses = c(timestamp = "nanotime"))
)
Unit: milliseconds
expr min lq mean median uq max neval
fread.posixct 1.507113 1.562507 1.661992 1.617307 1.692918 3.296795 100
fread.nanotime 6.879102 7.264173 7.695965 7.486686 7.814365 11.739935 100
But using as.nanotime is faster than the as.POSIXct to convert timestamp strings.
timestamp.str = fread("~/nanotime.csv", sep = ",", tz = "")$timestamp
microbenchmark(
as.posixct = as.POSIXct(timestamp.str, tz = "UTC"),
as.nanottime = as.nanotime(timestamp.str, tz = "UTC")
)
Unit: milliseconds
expr min lq mean median uq max neval
as.posixct 13.134221 13.334536 14.244936 13.529305 14.255547 25.690105 100
as.nanottime 1.886608 1.937244 2.095484 1.971293 2.059592 4.541754 100
sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4.9 data.table_1.14.4 nanotime_0.3.6
loaded via a namespace (and not attached):
[1] zoo_1.8-10 bit_4.0.4 compiler_4.1.3 RcppCCTZ_0.2.11
[5] Rcpp_1.0.9 bit64_4.0.5 grid_4.1.3 lattice_0.20-45
To wrap up, this issue is about nanotime parser for fread. Maybe that part could be inside nanotime package at C level, and data.table would just call their C routine. Not sure how well that would fit, but it seems to be proper place for it.
@jangorecki thanks for you reply, the suggestion you said reminded me that we should first test the performance bottleneck of parsing nanotime in fread(freadR.c and fread.r) and then to find the possible optimization.
here is my new test, cost time : fread.nanotime ≈ fread.str + as.nanotime
r$> microbenchmark(
fread.posixct = fread("~/nanotime.csv", sep = ","),
fread.nanotime = fread("~/nanotime.csv", sep = ",", colClasses = c(timestamp = "nanotime")),
fread.str = fread("~/nanotime.csv", sep = ",", tz = ""),
as.nanotime = as.nanotime(t.str$timestamp)
)
Unit: milliseconds
expr min lq mean median uq max neval
fread.posixct 1.587341 1.658027 1.744677 1.693936 1.760134 2.559966 100
fread.nanotime 7.423895 7.728727 8.165997 7.901187 8.156232 16.042013 100
fread.str 1.820626 1.933135 2.085893 2.000488 2.131145 4.908817 100
as.nanotime 5.207591 5.400650 5.756606 5.499481 5.687385 13.326389 100
I'm also not sure how to improve it. As shown above, the fread.posixct is also faster than the as.nanotime with string input.
Possibly fread.nanotime is str+as.nanotime under the hood, while posixct has its own parser.
Possibly fread.nanotime is str+as.nanotime under the hood, while posixct has its own parser. I think so
Hello, I was trying to get a feel of how efficient fread is reading
nanotime
. I did the very naive comparaison bellow against kdb.not reading nanotimes fread is approx matching kdb
R
First 5 runs gives the following
kdb
reading nanotimes fread is slower while kdb is approx. as fast as reading symbols
timings are:
kdb
I know 5 runs is probably insufficient and that mmap is tricky so the above results might be useless but the point is: is there something that can be done on the user side to speed things up or is it just that nanotime is not as efficient as parsing strings than kdb is ?
session