Added support for nanosecond timestamps

eddelbuettel commented 4 years ago

Really nice package, congrats! I was looking for a data sample with actual market data containing nanosecond resolution, and my co-author @lsilvest pointed me to your blog post, package and the NASDAQ ftp site. Right on!

I made a quick modification importing our nanotime package. Would you be interested in a PR? The change is pretty simple as nanotime() can convert from a date (which you have) and just add the nanoseconds since midnight (which you parsed). On the 20190530.BX_ITCH_50 file I now get (and you need to scroll to the right to see the converted column)

edd@rob:~/git/ritch-demo$ r quickTest.R 
[Counting]   36458117 messages found
[Converting] to data.table
227386 messages found
[Loading]    ...........
[Converting] to data.table
[Formatting]
        msg_type locate_code tracking_number   timestamp order_ref  buy shares stock  price match_number cross_type       date                            datetime
     1:        P        1143               2 2.57541e+13         0 TRUE    200  BYND 102.72        17582       <NA> 2019-05-30 2019-05-30T07:09:14.082195056+00:00
     2:        P        1143               2 2.57541e+13         0 TRUE     64  BYND 102.72        17583       <NA> 2019-05-30 2019-05-30T07:09:14.083232401+00:00
     3:        P        7914               2 2.65237e+13         0 TRUE    100  TVIX  23.90        17585       <NA> 2019-05-30 2019-05-30T07:22:03.749811889+00:00
     4:        P        7914               4 2.65237e+13         0 TRUE      1  TVIX  23.91        17586       <NA> 2019-05-30 2019-05-30T07:22:03.749811889+00:00
     5:        P        7914               2 2.66370e+13         0 TRUE     50  TVIX  23.91        17587       <NA> 2019-05-30 2019-05-30T07:23:56.984628764+00:00
    ---                                                                                                                                                           
227382:        P        7967               6 6.26253e+13         0 TRUE    700  UBER  41.20      1072268       <NA> 2019-05-30 2019-05-30T17:23:45.331694633+00:00
227383:        P        7967               2 6.26508e+13         0 TRUE    100  UBER  41.22      1072271       <NA> 2019-05-30 2019-05-30T17:24:10.812690889+00:00
227384:        P        5798               2 6.27109e+13         0 TRUE      1  OKTA 114.96      1072273       <NA> 2019-05-30 2019-05-30T17:25:10.873864531+00:00
227385:        P        3463               4 6.29019e+13         0 TRUE    200   GPS  18.25      1072275       <NA> 2019-05-30 2019-05-30T17:28:21.864020526+00:00
227386:        P        1017               4 6.78570e+13         0 TRUE    100  BRFS   7.79      1072284       <NA> 2019-05-30 2019-05-30T18:50:57.005154488+00:00
edd@rob:~/git/ritch-demo$

Of course, this would add a new dependency as maintainers don't always want that. On the other hand data.table already supports it too so ... Let me know what you think.

DavZim commented 4 years ago

I'm glad that you like the package and found it useful!

I am always happy to receive a PR and don't mind adding nanotime as a dependency as I have already used it and found it quite useful and stable.

If you have any other comment, idea, or usecase, feel free to open another issue or otherwise get in contact with me!

eddelbuettel commented 4 years ago

Great. I will send a PR; it cleaned up a few other (minor) R CMD check issues too. If you find the PR too large we could disentangle it.

eddelbuettel commented 4 years ago

We could do one better and return the timestamp (and other, potentially 'large' values such as message ids) as integer64 from the C++ side -- this is described here:

https://gallery.rcpp.org/articles/creating-integer64-and-nanotime-vectors/

Not too urgent but maybe an improvement worth making. If you want I can probably add another simple commit, or you could follow the Gallery piece.

DavZim commented 4 years ago

So far this package uses c++'s unsigned long long for most numbers related to counting. Do we lose (too much) time on the conversion to R types?

I think right now, the biggest bottleneck is that we parse everything to std::vector<unsigned long long/string> (see MessageTypes.h), then bundle it together to a Rcpp::DataFrame::create() (MessageTypes.cpp), bring it back to R, and convert it to a data.table (get_orders.R). It would be a lot leaner if we where able to directly create and write a data.table in Rcpp, but so far I haven't found an API for that. If you have any resources for that, I could reimplement that.

FYI: I have updated some smaller parts in the verbosity of the functions (added timings) and added an rmarkdown Readme after your PR.

eddelbuettel commented 4 years ago

Couple quick things:

so far I haven't found an API for that

There isn't :-/ I asked years ago (there is an old issue tickets). The first bits of code forming an API were added by @lsilvest who had opened that can himself and then contributed the code to data.table.

Next, "conversion". There is simply no other way around. R only has (signed) int32 and double. Nothing else. So we always have that one copy to make.

Lastly, "style". I also used to program with unsigned int etc pp. But eventually I also converted to the more common and explicit style of using int32_t, int64_t, ... plus the unsigned types uint32_t. Only with these are you guaranteed how many bits are spent on your data. The C standard (and hence C++) leave room for compiler details. That is generally not a good thing :) We haven;t had a transition in a while, but it would be worthwhile to clean this up.

In R, and thanks to the ingenuity of Jens, we sorta/kinda have int64_t. We do not have uint64_t so there is a risk of lossy conversion at the very top of the range. Nothing really we can do about it (unless we hack this up a la Jens and implement a uint64_t to double mapping).

Maybe we should open fresh issues for new topics.

DavZim commented 4 years ago

Very good to know these tweaks, I will apply them as soon as I find time for it. Thank you for taking the time for the valuable comments, they are greatly appreciated.

To summarise my todos: Change unsigned long long to int64_t, but even better initiate the vectors directly as Rcpp::NumericVector and use memcpy to copy the 64 bit values. Set the S4 class of that vector to nanotime (following your example). This means we loose a bit of flexibility as we include more Rcpp code (for example if someone wants to port this version of the ITCH parser to python or so), but that sounds like a cheap price to pay for a potential speedup and overall nicer code - will do!

Lastly, wait for a native data.table Rcpp API. :) or get active in that direction.

eddelbuettel commented 4 years ago

Sounds good! Conversion of unsigned long long int to int64_t sounds like tedium but is better in the long run. In your case converting 'down in C++' to nanotime and/or int64 may not matter much performance-wise, but it is a nice trick to know.

lsilvest commented 4 years ago

About creating a data.table at C or C++ level without a deep copy, I think the data.table package exports the main function that is needed and it seems to be relatively straightforward. I've created https://github.com/lsilvest/data.table_at_c_level to show a solution.

That's something I will be testing extensively as I need that functionality for my rztsdb package, so I will be able to report of how the solution fares.

eddelbuettel commented 4 years ago

Fantastic, will look more closely later! @lsilvest I may tie this up in a Rcpp Gallery story. You may want to contribute the helper functions as a header to data.table with properly inlined code so that we can just include it elsewhere. Or make it a miniscule helper package.

DavZim commented 4 years ago

Indeed, really helpful comment and code, that I was able to use right away (shaves off about 1 second from the get_orders() et al functions). Push will come to this repo as soon as it is fully operational.

DavZim / RITCH

Added support for nanosecond timestamps #14