KxSystems / rkdb

R client for kdb+
https://code.kx.com/q/interfaces
Apache License 2.0
41 stars 29 forks source link

null conversion is incorrect #26

Closed statquant closed 6 years ago

statquant commented 6 years ago

Quick question to understand if this package will be developed further and new feature added (and/or debugged)

sv commented 6 years ago

yes. What features/fixes you would like to see? There is a pending release with following updates(current master vs 0.10) https://github.com/KxSystems/rkdb/compare/v0.10.0...master

We are happy to take patches as well.

statquant commented 6 years ago

Well as a start I'd think there should be test to make sure types are correctly cast from R to q and vice versa (as people are more likely to use this package to pass R data to q for processing than just querying a q server) I think it misses a clear description of what to expect. I understand patches are accepted, but it is still a kx product, so I was just wondering if kx had any plan to develop this package at a stage where it could be reliably used. Many thanks for starting this effort, it is very much appreciated.

statquant commented 6 years ago

In kdb -> R it is not too far

Fetching the following table

t:([]                                                                                               
    booleans:010b;                                                                                  
    shorts:`short$(1 2 0N);                                                                         
    ints:`int$(1 2 0N);                                                                             
    longs:`long$(1 2 0N);                                                                           
    reals:`real$(1 2 0N);                                                                           
    floats:`float$(1 2 0N);                                                                         
    chars:"ab ";                                                                                    
    symbols:`aa`bb`;                                                                                
    timestamps:(2015.01.01T00:00:00.000000000 2015.01.02T00:00:01.200000001 0Np);                   
    months:`month$(2006.07.01 2006.08.01 0N);                                                       
    dates:(2006.07.01 2006.08.01 0Nd);                                                              
    datetimes:(2006.07.21T09:13:39 2006.07.21T09:13:39.123 0Nz);                                    
    timespan:(12:00:00.000000000 12:00:00.123456789 0Nn);                                           
    minutes:(23:59 23:58 0Nu);                                                                      
    seconds:(23:59:59 23:59:58 0Nv);                                                                
    times:(09:01:02.042 09:01:02.043 0Nt);                                                          
    ts:(([] c1:`a`b`c; c2:10 20 30);([] c1:`a`b`c; c2:10 20 32);([] c1:`a`b`c; c2:10 20 33));       
    dictionaries:((`a`b`v!10 20 30);(`a`b`v!10 20 31);(`a`b`v!10 20 32));                           
    functions:({x};{x+1};{x+2})                                                                     
 );         

This arrives in R as:

   booleans shorts ints         longs reals floats chars symbols          timestamps months          dates           datetimes           timespan  minutess    seconds    times           ts dictionaries functions
1:    FALSE      1    1  1.000000e+00     1      1     a      aa 2015-01-01 00:00:00     78     2006-07-01 2006-07-21 09:13:39  4.320000e+04 secs 1439 mins 86399 secs 32462042 <data.frame>     10,20,30       {x}
2:     TRUE      2    2  2.000000e+00     2      2     b      bb 2015-01-02 00:00:01     79     2006-08-01 2006-07-21 09:13:39  4.320012e+04 secs 1438 mins 86398 secs 32462043 <data.frame>     10,20,31     {x+1}
3:    FALSE -32768   NA -9.223372e+18   NaN    NaN               1707-09-22 00:11:28     NA -5877611-06-22                <NA> -9.223372e+09 secs   NA mins    NA secs       NA <data.frame>     10,20,32     {x+2}

a look at types within the result in R

> str(DF)
Classes ‘data.table’ and 'data.frame':  3 obs. of  19 variables:
 $ booleans    : logi  FALSE TRUE FALSE
 $ shorts      : int  1 2 -32768
 $ ints        : int  1 2 NA
 $ longs       : num  1.00 2.00 -9.22e+18
 $ reals       : num  1 2 NaN
 $ floats      : num  1 2 NaN
 $ chars       : chr  "a" "b" " "
 $ symbols     : chr  "aa" "bb" ""
 $ timestamps  : POSIXct, format: "2015-01-01 00:00:00" "2015-01-02 00:00:01" "1707-09-22 00:11:28"
 $ months      : int  78 79 NA
 $ dates       : Date, format: "2006-07-01" "2006-08-01" "-5877611-06-22"
 $ datetimes   : POSIXct, format: "2006-07-21 09:13:39" "2006-07-21 09:13:39" NA
 $ timespan    :Class 'difftime'  atomic [1:3] 4.32e+04 4.32e+04 -9.22e+09
  .. ..- attr(*, "units")= chr "secs"
 $ minutess    :Class 'difftime'  atomic [1:3] 1439 1438 NA
  .. ..- attr(*, "units")= chr "mins"
 $ seconds     :Class 'difftime'  atomic [1:3] 86399 86398 NA
  .. ..- attr(*, "units")= chr "secs"
 $ times       : int  32462042 32462043 NA
 $ ts          :List of 3
  ..$ :'data.frame':    3 obs. of  2 variables:
  .. ..$ c1: chr  "a" "b" "c"
  .. ..$ c2: num  10 20 30
  ..$ :'data.frame':    3 obs. of  2 variables:
  .. ..$ c1: chr  "a" "b" "c"
  .. ..$ c2: num  10 20 32
  ..$ :'data.frame':    3 obs. of  2 variables:
  .. ..$ c1: chr  "a" "b" "c"
  .. ..$ c2: num  10 20 33
 $ dictionaries:List of 3
  ..$ : Named num  10 20 30
  .. ..- attr(*, "names")= chr  "a" "b" "v"
  ..$ : Named num  10 20 31
  .. ..- attr(*, "names")= chr  "a" "b" "v"
  ..$ : Named num  10 20 32
  .. ..- attr(*, "names")= chr  "a" "b" "v"
 $ functions   : chr  "{x}" "{x+1}" "{x+2}"

As such the bugs are

  1. short 0N are not handled correctly => should be NaN
  2. int 0N are not consistently transformed to NaN but to NA => should be NaN
  3. short 0N are not handled correctly => should be NaN
  4. timstamp 0N are not handled correctly => should be NA
  5. date 0N are not handled correctly => should be NA
  6. timespan 0N are not handled correctly => should be NA

There are also questionable choices

  1. months are casted to int, I thing we would be better of as Character like "2015-01"
  2. times seem to be casted as integer why not difftime like all the other
  3. time measurement in general are coming back as difftime, I personally hate that but I can see the point. Problem is that resolution then is limited to milliseconds, I advise we use difftime for kdb objects that have a resolution less or equal to milli and character for others.
  4. timestamp comes back as POSIXct, this implies the same milli (to macro depending on plateform) limitation. Can I suggest looking at https://github.com/eddelbuettel/nanotime that seems to be supporting nanos (by storing on int64)
sv commented 6 years ago

We have a strong interest in getting this package to the stage when it can be reliably used. Development is user and usage driven based on reports like this. There are few people who have some code built around rkdb with lots of experience of using it(including you) and would be nice if we can push some of the common code to this package.

sv commented 6 years ago

I have pushed fixes for null conversions in my repo https://github.com/sv/rkdb/commit/036e2131e9c0119d8149d9ba8086c7b9c2f6cc3d would be great if you could try it via devtools

sv commented 6 years ago

raised PR https://github.com/KxSystems/rkdb/pull/29 and would be great if you give it a go

statquant commented 6 years ago

Ok I did using kdn 3.5 and your lattest version from your sv repo

q)t
booleans shorts ints longs reals floats chars symbols timestamps                    months  dates      datetimes               timespan             minutes seconds  times        ts                        dictionaries    functions
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0        1      1    1     1     1      a     aa      2015.01.01D00:00:00.000000000 2006.07 2006.07.01 2006.07.21T09:13:39.000 0D12:00:00.000000000 23:59   23:59:59 09:01:02.042 +`c1`c2!(`a`b`c;10 20 30) `a`b`v!10 20 30 {x}      
1        2      2    2     2     2      b     bb      2015.01.02D00:00:01.200000001 2006.08 2006.08.01 2006.07.21T09:13:39.123 0D12:00:00.123456789 23:58   23:59:58 09:01:02.043 +`c1`c2!(`a`b`c;10 20 32) `a`b`v!10 20 31 {x+1}    
0                                                                                                                                                                                 +`c1`c2!(`a`b`c;10 20 33) `a`b`v!10 20 32 {x+2}  

comes in R

R> execute(con,'t')
  booleans shorts ints longs reals floats chars symbols            timestamps months      dates               datetimes   timespan   minutes    seconds    times                  ts dictionaries functions
1    FALSE      1    1     1     1      1     a      aa 2014-12-31 19:00:00.0     78 2006-07-01 2006-07-21 09:13:39.000 43200 secs 1439 mins 86399 secs 32462042 a, b, c, 10, 20, 30   10, 20, 30       {x}
2     TRUE      2    2     2     2      2     b      bb 2015-01-01 19:00:01.2     79 2006-08-01 2006-07-21 09:13:39.122 43200 secs 1438 mins 86398 secs 32462043 a, b, c, 10, 20, 32   10, 20, 31     {x+1}
3    FALSE     NA   NA   NaN   NaN    NaN                                <NA>     NA       <NA>                    <NA>   NaN secs   NA mins    NA secs       NA a, b, c, 10, 20, 33   10, 20, 32     {x+2}

0Nx are all handled, though not consistently as to what is NA and NaN I realized this test on a NYC server (both R and kdb) I just realized that the timestamp and the datetime seem to handle timezone differently as the datetime is matching but the timestamp has a 5 hours shift in R (my local might be Europe/London)

Regards

sv commented 6 years ago

Please raise another issue for time shift if needed. This has been merged.