Closed eddelbuettel closed 4 years ago
I'm glad that you like the package and found it useful!
I am always happy to receive a PR and don't mind adding nanotime as a dependency as I have already used it and found it quite useful and stable.
If you have any other comment, idea, or usecase, feel free to open another issue or otherwise get in contact with me!
Great. I will send a PR; it cleaned up a few other (minor) R CMD check
issues too. If you find the PR too large we could disentangle it.
We could do one better and return the timestamp (and other, potentially 'large' values such as message ids) as integer64
from the C++ side -- this is described here:
https://gallery.rcpp.org/articles/creating-integer64-and-nanotime-vectors/
Not too urgent but maybe an improvement worth making. If you want I can probably add another simple commit, or you could follow the Gallery piece.
So far this package uses c++'s unsigned long long
for most numbers related to counting. Do we lose (too much) time on the conversion to R types?
I think right now, the biggest bottleneck is that we parse everything to std::vector<unsigned long long/string>
(see MessageTypes.h
), then bundle it together to a Rcpp::DataFrame::create()
(MessageTypes.cpp
), bring it back to R, and convert it to a data.table (get_orders.R
). It would be a lot leaner if we where able to directly create and write a data.table in Rcpp
, but so far I haven't found an API for that. If you have any resources for that, I could reimplement that.
FYI: I have updated some smaller parts in the verbosity of the functions (added timings) and added an rmarkdown Readme after your PR.
Couple quick things:
so far I haven't found an API for that
There isn't :-/ I asked years ago (there is an old issue tickets). The first bits of code forming an API were added by @lsilvest who had opened that can himself and then contributed the code to data.table
.
Next, "conversion". There is simply no other way around. R only has (signed) int32
and double
. Nothing else. So we always have that one copy to make.
Lastly, "style". I also used to program with unsigned int
etc pp. But eventually I also converted to the more common and explicit style of using int32_t
, int64_t
, ... plus the unsigned types uint32_t
. Only with these are you guaranteed how many bits are spent on your data. The C standard (and hence C++) leave room for compiler details. That is generally not a good thing :) We haven;t had a transition in a while, but it would be worthwhile to clean this up.
In R, and thanks to the ingenuity of Jens, we sorta/kinda have int64_t
. We do not have uint64_t
so there is a risk of lossy conversion at the very top of the range. Nothing really we can do about it (unless we hack this up a la Jens and implement a uint64_t
to double
mapping).
Maybe we should open fresh issues for new topics.
Very good to know these tweaks, I will apply them as soon as I find time for it. Thank you for taking the time for the valuable comments, they are greatly appreciated.
To summarise my todos:
Change unsigned long long
to int64_t
, but even better initiate the vectors directly as Rcpp::NumericVector
and use memcpy
to copy the 64 bit values. Set the S4 class of that vector to nanotime (following your example).
This means we loose a bit of flexibility as we include more Rcpp code (for example if someone wants to port this version of the ITCH parser to python or so), but that sounds like a cheap price to pay for a potential speedup and overall nicer code - will do!
Lastly, wait for a native data.table Rcpp API. :) or get active in that direction.
Sounds good! Conversion of unsigned long long int
to int64_t
sounds like tedium but is better in the long run. In your case converting 'down in C++' to nanotime
and/or int64
may not matter much performance-wise, but it is a nice trick to know.
About creating a data.table at C or C++ level without a deep copy, I think the data.table package exports the main function that is needed and it seems to be relatively straightforward. I've created https://github.com/lsilvest/data.table_at_c_level to show a solution.
That's something I will be testing extensively as I need that functionality for my rztsdb
package, so I will be able to report of how the solution fares.
Fantastic, will look more closely later! @lsilvest I may tie this up in a Rcpp Gallery story. You may want to contribute the helper functions as a header to data.table with properly inlined code so that we can just include it elsewhere. Or make it a miniscule helper package.
Indeed, really helpful comment and code, that I was able to use right away (shaves off about 1 second from the get_orders()
et al functions). Push will come to this repo as soon as it is fully operational.
Really nice package, congrats! I was looking for a data sample with actual market data containing nanosecond resolution, and my co-author @lsilvest pointed me to your blog post, package and the NASDAQ ftp site. Right on!
I made a quick modification importing our nanotime package. Would you be interested in a PR? The change is pretty simple as
nanotime()
can convert from a date (which you have) and just add the nanoseconds since midnight (which you parsed). On the20190530.BX_ITCH_50
file I now get (and you need to scroll to the right to see the converted column)Of course, this would add a new dependency as maintainers don't always want that. On the other hand
data.table
already supports it too so ... Let me know what you think.