Read from gz file via gzopen()

eddelbuettel commented 4 years ago

One thing that may work and make a decent difference it using gzopen (from zlib, which R always has) as well as gzseek(), gztell() etc. I am not 100% sure it will as I do not know the ITCH spec as well as you -- but for mostly straight reading I already used gzopen (instead of fopen) decades ago :)

Might be worth a try.

See e.g. https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/libzman.html for docs, and e.g. my RcppCNPy package for a (much) simpler use case.

DavZim commented 4 years ago

I am not really sure about this one. Let me explain:

I see two options to include zlib.

for each file operation (fopen(), fread(), ftell(), fseek(),fclose()`), add the appropriate zlib case depending on file type.
replace the call to R.utils::gunzip() by an internal Rcpp function, which "just" inflates the gzipped ITCH file to a temporary one, reducing the dependency of R.utils.

Ad 1:

Currently the get_ functions pass the file twice under normal use (this is seen in the function getMessages()). First to count all messages to be able to reserve space for the vectors that hold the message information (countMessages), then to actually parse the message information into the vectors (loadToMessages()).

I have added all appropriate zlib calls, but because it inflates the file twice, replacing R.utils with zlib increases runtime

R.utils inflates once, total runtime ~8.02 secs, thereof 5.4secs for the actual inflation
zlib needs to inflate the files twice, total runtime ~10.27secs, 4.2s for countMessages(), 5.3s for loadToMessages() (on plain files the runtimes for the two functions are 0.5s and 1.2s)

We see that zlib inflates the data by itself faster than R.utils does. But to properly incorporate 1. I would need to restructure RITCHs internal logic to avoid first counting than parsing the information, thus passing/reading the file twice (this btw only applies to the case where no end_msg_count or count_message() is given).

A brute force option would be to resize the vectors to some reasonably large number (say 100m), parse the information and then prune the vectors again. Not sure if this is a smart way to go about.

Ad 2: This seems more promising, this takes around 4 seconds (compared to 5s in R.utils) and has little restructuring as a consequence.

Does this make sense to you?

eddelbuettel commented 4 years ago

It does make sense. But consider

edd@rob:~/git/ritch-demo$ ls -l *gz *rds
-rw-r--r-- 1 edd edd 487235329 Apr 20 08:00 20190530.BX_ITCH_50.gz
-rw-r--r-- 1 edd edd 721906618 Apr 20 08:20 20200130.BX_ITCH_50.gz
edd@rob:~/git/ritch-demo$

The source files are compressed. We would save the diskspace. You could skip have the if/else and just use compressed file.

But I didn't realize the dual read and hence dual uncompress. So it really is a 'space vs time' tradeoff. Your package, your call. I just wanted to mention one can work off a .gz.

As for 2., I would definitely go the route of R's uncompressor and drop one Depends: if you can. But then I also believe in the tinyverse :grinning:

DavZim commented 4 years ago

Closed with 225e7ab287ad74c818d93c79b9fe41f9192b9c01

This includes the zlib.h library, meaning that the decompression is done in rcpp (which is faster). Also, I implemented some functionality, that hopefully makes it easier to deal with the files. If you specify a *.XX_ITCH_50.gz file, it first checks if *.XX_ITCH_50 already exists. If no extra option to overwrite that file was specified, it takes the already gunzipped file, otherwise gunzip that file. Additionally, a force_cleanup flag was added, which, if set to TRUE removes the raw file afterwards if a gz file was given.

I think this is a nice middle way, which both boosts speed of the package but also adds the functionality to easily work with either raw files or .gz files.

DavZim commented 4 years ago

Fyi, I just "verschlimmbessert" the library and broke it in the process. Will be fixed later.

DavZim commented 4 years ago

Turns out, I was comparing apples with oranges, the newest version has around the same runtimes, but uses considerably less RAM. Newest version referring to the approach using gzopen etc. For the sample file 20191230.BX_ITCH_50.gz, max RAM usage for the get_X functions is now at 1.4GB down from 2.2GB.

eddelbuettel commented 4 years ago

Count me in the 'verschlimmbessert' camp. Fresh pull and rebuild:

edd@rob:~/git/ritch-demo$ Rscript createTradesArrayLarge.r
[Decompressing] 20200130.BX_ITCH_50.gz

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: gunzipFile_impl(infile, outfile, buffer_size)
 2: gunzip_file(file, raw_file, buffer_size)
 3: check_and_gunzip(file, buffer_size, force_gunzip, quiet)
 4: count_messages(file, add_meta_data = TRUE)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
edd@rob:~/git/ritch-demo$

Works from text. I'm sure we'll get there eventually.

edd@rob:~/git/ritch-demo$ zcat 20200130.BX_ITCH_50.gz > 20200130.BX_ITCH_50
edd@rob:~/git/ritch-demo$ Rscript createTradesArrayLarge.r
[INFO] Unzipped file 20200130.BX_ITCH_50 already found, using that (overwrite with force_gunzip=TRUE)
[Counting]   53,268,643 messages found
[Converting] to data.table
[Done]       in 1.11 secs
[INFO] Unzipped file 20200130.BX_ITCH_50 already found, using that (overwrite with force_gunzip=TRUE)
[Counting]   226,312 messages found
[Loading]    .
[Converting] to data.table
[Done]       in 1.27 secs
edd@rob:~/git/ritch-demo$

DavZim commented 4 years ago

That is interesting... First of all, another reason to work on #17 especially having an ITCH writer and automated tests...

I'm downloading the file and try to replicate the error. Will get back to you!

DavZim commented 4 years ago

Ok, I was able to replicate and resolve the issue:

The default buffer_size in the gunzip function is 4 times the compressed file size. The issue got triggered, when that was above 2.1e9 (to be more precise, .Machine$integer.max).

First I thought that somehow passing that number to Rcpp triggered an integer overflow as I found that it was truncated to -1 in the gunzip function. A buffer of size -1 is a wonderful idea... so thought the session and crashed. However, as I found out now, zlibs gzread() uses an unsigned int for the buffer length... That is fixed and now everything should work as expected - again.

Feel free to prove me wrong... :D

DavZim / RITCH

Read from gz file via gzopen() #18