Closed eddelbuettel closed 4 years ago
I am not really sure about this one. Let me explain:
I see two options to include zlib.
fopen()
, fread()
, ftell()
, fseek(),
fclose()`), add the appropriate zlib case depending on file type.R.utils::gunzip()
by an internal Rcpp function, which "just" inflates the gzipped ITCH file to a temporary one, reducing the dependency of R.utils.Ad 1:
Currently the get_
functions pass the file twice under normal use (this is seen in the function getMessages()
). First to count all messages to be able to reserve space for the vectors that hold the message information (countMessages
), then to actually parse the message information into the vectors (loadToMessages()
).
I have added all appropriate zlib calls, but because it inflates the file twice, replacing R.utils with zlib increases runtime
countMessages()
, 5.3s for loadToMessages()
(on plain files the runtimes for the two functions are 0.5s and 1.2s)We see that zlib inflates the data by itself faster than R.utils does. But to properly incorporate 1. I would need to restructure RITCHs internal logic to avoid first counting than parsing the information, thus passing/reading the file twice (this btw only applies to the case where no end_msg_count
or count_message()
is given).
A brute force option would be to resize the vectors to some reasonably large number (say 100m), parse the information and then prune the vectors again. Not sure if this is a smart way to go about.
Ad 2: This seems more promising, this takes around 4 seconds (compared to 5s in R.utils) and has little restructuring as a consequence.
Does this make sense to you?
It does make sense. But consider
edd@rob:~/git/ritch-demo$ ls -l *gz *rds
-rw-r--r-- 1 edd edd 487235329 Apr 20 08:00 20190530.BX_ITCH_50.gz
-rw-r--r-- 1 edd edd 721906618 Apr 20 08:20 20200130.BX_ITCH_50.gz
edd@rob:~/git/ritch-demo$
The source files are compressed. We would save the diskspace. You could skip have the if/else and just use compressed file.
But I didn't realize the dual read and hence dual uncompress. So it really is a 'space vs time' tradeoff. Your package, your call. I just wanted to mention one can work off a .gz.
As for 2., I would definitely go the route of R's uncompressor and drop one Depends: if you can. But then I also believe in the tinyverse :grinning:
Closed with 225e7ab287ad74c818d93c79b9fe41f9192b9c01
This includes the zlib.h library, meaning that the decompression is done in rcpp (which is faster). Also, I implemented some functionality, that hopefully makes it easier to deal with the files. If you specify a *.XX_ITCH_50.gz
file, it first checks if *.XX_ITCH_50
already exists. If no extra option to overwrite that file was specified, it takes the already gunzipped file, otherwise gunzip that file.
Additionally, a force_cleanup
flag was added, which, if set to TRUE removes the raw file afterwards if a gz file was given.
I think this is a nice middle way, which both boosts speed of the package but also adds the functionality to easily work with either raw files or .gz files.
Fyi, I just "verschlimmbessert" the library and broke it in the process. Will be fixed later.
Turns out, I was comparing apples with oranges, the newest version has around the same runtimes, but uses considerably less RAM. Newest version referring to the approach using gzopen etc.
For the sample file 20191230.BX_ITCH_50.gz
, max RAM usage for the get_X
functions is now at 1.4GB down from 2.2GB.
Count me in the 'verschlimmbessert' camp. Fresh pull and rebuild:
edd@rob:~/git/ritch-demo$ Rscript createTradesArrayLarge.r
[Decompressing] 20200130.BX_ITCH_50.gz
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: gunzipFile_impl(infile, outfile, buffer_size)
2: gunzip_file(file, raw_file, buffer_size)
3: check_and_gunzip(file, buffer_size, force_gunzip, quiet)
4: count_messages(file, add_meta_data = TRUE)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
edd@rob:~/git/ritch-demo$
Works from text. I'm sure we'll get there eventually.
edd@rob:~/git/ritch-demo$ zcat 20200130.BX_ITCH_50.gz > 20200130.BX_ITCH_50
edd@rob:~/git/ritch-demo$ Rscript createTradesArrayLarge.r
[INFO] Unzipped file 20200130.BX_ITCH_50 already found, using that (overwrite with force_gunzip=TRUE)
[Counting] 53,268,643 messages found
[Converting] to data.table
[Done] in 1.11 secs
[INFO] Unzipped file 20200130.BX_ITCH_50 already found, using that (overwrite with force_gunzip=TRUE)
[Counting] 226,312 messages found
[Loading] .
[Converting] to data.table
[Done] in 1.27 secs
edd@rob:~/git/ritch-demo$
That is interesting... First of all, another reason to work on #17 especially having an ITCH writer and automated tests...
I'm downloading the file and try to replicate the error. Will get back to you!
Ok, I was able to replicate and resolve the issue:
The default buffer_size
in the gunzip function is 4 times the compressed file size. The issue got triggered, when that was above 2.1e9 (to be more precise, .Machine$integer.max
).
First I thought that somehow passing that number to Rcpp triggered an integer overflow as I found that it was truncated to -1 in the gunzip function. A buffer of size -1 is a wonderful idea... so thought the session and crashed.
However, as I found out now, zlib
s gzread()
uses an unsigned int
for the buffer length... That is fixed and now everything should work as expected - again.
Feel free to prove me wrong... :D
One thing that may work and make a decent difference it using
gzopen
(fromzlib
, which R always has) as well asgzseek()
,gztell()
etc. I am not 100% sure it will as I do not know the ITCH spec as well as you -- but for mostly straight reading I already usedgzopen
(instead offopen
) decades ago :)Might be worth a try.
See e.g. https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/libzman.html for docs, and e.g. my RcppCNPy package for a (much) simpler use case.