KxSystems / rkdb

R client for kdb+
https://code.kx.com/q/interfaces
Apache License 2.0
41 stars 30 forks source link

timestamp conversion not precise #30

Closed lwshang closed 6 years ago

lwshang commented 6 years ago

I'm aware of the truth that it is not possible to convert a timestamp of q into POSIXct in R perfectly. It is still reasonable to request that: every R POSIXct converted to q timestamp and then converted back to R POSIXct should give the same value as original R POSIXct.

However, current rkdb has a problem with it. For example:

> Qtorigin <- ISOdatetime(2000,1,1,0,0,0,tz='UTC')
> x <- as.POSIXct(1534458275.511368513107,origin=ISOdatetime(1970,1,1,0,0,0,tz='UTC'))
> format(x,'%Y-%m-%d %H:%M:%OS6 %Z',tz='UTC')
[1] "2018-08-16 22:24:35.511368 UTC"
# xs is a string that should generate the closest Q timestamp
> xs <- sprintf('12h$%.0fj',1e9*as.numeric(x-Qtorigin,units='secs'))
> xs
[1] "12h$587773475511368576j"

In a q session,

q)12h$587773475511368576j
2018.08.16D22:24:35.511368576
> y <- execute(db,xs)                                   
#  It is the same to the maximum 6 decimal places that R shows
> format(y,'%Y-%m-%d %H:%M:%OS6 %Z',tz='UTC')
[1] "2018-08-16 22:24:35.511368 UTC"

#  It is close but it is not the same
> x == y
[1] FALSE
> y - x

This issue is caused by floating-point arithmetic in https://github.com/KxSystems/rkdb/blob/cd36c3eaf76e0bd379d3f767ba27f3bbf3b9e262/src/common.c#L521-L534

The problem is that the computation 946684800 + x->j / 1e9; "promotes" the long integer x->j to double precision in order to divide by 1e9. But a double precision mantissa has only 53 bits, and x->j is so large that it requires 59 bits (as of year 2018). So precision is lost.

I've come up with a fix by converting the integer number of seconds and nanoseconds separately. I would like to implement it in my fork of this project and send a pull request shortly.

lwshang commented 6 years ago

There is another example to show the problem.

> library(rkdb)
> db <- open_connection(port=9999)
> diff(which(diff(execute(db, ".z.P + til 10000"))>0))
 [1] 129 383 129 256 127 385 127 385 127 385 383 129 383 129 383 129 256 127 385
[20] 127 385 127 385 127 256 129 383 129 383 129 383 385 127 385 127 385 127 256
[39] 129

I convert 10,000 q consecutive timestamps into R. It's expected to see that a bunch of timestamps point to the same R POSIXct. But the length of each step should also be uniform. More precisely, considering the binary representation of the R POSIXct (numeric), the minimal increment of it is 2 ** -22 = 2.384186e-07 which is about 238.4 nanoseconds. So the step should be close to 238.

sv commented 6 years ago

Thanks. Could you please add these as test cases in rkdb/tests/testthat/?

lwshang commented 6 years ago

During the procedure of running the tests, I got following test failures:

✔ | OK F W S | Context
✖ | 58 5     | basic
────────────────────────────────────────────────────────────────────────────────
test-basic.R:55: failure: kdb types to R types
as.POSIXct(timestamp) not equal to as.POSIXct("2015-01-01 00:01:00.000000").
1/1 mismatches
[1] 2014-12-31 19:01:00 - 2015-01-01 00:01:00 == -5 hours

test-basic.R:73: failure: kdb types to R types
`timespan` not equal to as.difftime("12:00:00.000000000").
Attributes: < Component “units”: 1 string mismatch >
Mean relative difference: 0.9997222

test-basic.R:84: failure: kdb types to R types
`time` inherits from `integer` not `POSIXt`.

test-basic.R:85: failure: kdb types to R types
`time` not equal to as.POSIXct("12:00:00.000", format = "%H:%M:%S").
Classes differ: integer is not POSIXct/POSIXt

test-basic.R:161: failure: R types to kdb types
`namedVector` not equal to c(okType = TRUE, okValue = TRUE).
2 element mismatches
────────────────────────────────────────────────────────────────────────────────
✔ |  3       | dict
✔ |  1       | large [2.2 s]
✔ |  2       | tables
✔ |  1       | timestamp

Could you check it? Maybe we need another issue to track it.