JMMackenzie / IOQP

Impact Ordered Query Processing
Apache License 2.0
3 stars 0 forks source link

Read gzip CIFF file #35

Closed lgrz closed 6 months ago

lgrz commented 2 years ago

Detect if the input CIFF file is gzip encoded and decode the entire file to byte buffer. Otherwise use the existing mmap buffer implementation for uncompressed CIFF files.

Also looking for any comments on if this could be improved in any way. Thanks.

JMMackenzie commented 1 year ago

@mpetri - any comments?

lgrz commented 1 year ago

The limitations of this patch are that mmap is not used for gzip files. This means that (using CC-News as an example) an uncompressed ciff file will index CC-News with 256GB RAM (with the help of mmap), while a gzip compressed ciff file will require >256GB RAM due to not using mmap and being dumb by decoding everything to a byte buffer (/usr/bin/time reported around 305GB RAM). Smaller gzip CIFF files should work ok though of course.

While it does work, the gzip version can use up too much resources. I've been looking at ways to fix that so that mmap is used in all cases, and since we make multiple passes over the ciff file, we can build a "gzip index" on the first pass to decode sections of the gzip file on the fly. It turns out to be more involved that I thought it would be. It may take me sometime to get a prototype working though and flate2 doesn't support all of the zlib operations we need to make the index of entry points. Also not clear (to me) how this would work with parallel iterators either.

So there are a number of considerations here: accept the patch with its limitations (and wait for possible improvements); reject it because of the limitations as there is low benefit; possibly wait for an improved way of processing gzip ciff files with mmap support.

JMMackenzie commented 6 months ago

Coming back to this...

I think we merge and revisit later, it's unlikely anyone will be reading >RAM sized CIFF files I guess?

lgrz commented 6 months ago

Agree that we revisit later. I'm in favor of not merging now. And, if it is revisited then it will probably be useful and have some benefit to some usecase, and at that time the details can get another look then?

JMMackenzie commented 6 months ago

OK, I wont merge, will convert to draft and leave it as-is for now. Thanks!

lgrz commented 6 months ago

Some of my intentions that were missing from the previous comment. It is not clear to me that the changes in their current form, are a net improvement.

Therefore, this issue could be closed and be available for reference if this gets revisited in the future.