Quick question on usage

quinnj commented 9 years ago

Hey @dcjones, quick question on how to do what I want to do:

Basically, I have gzipped CSV files and need to transfer them to a web service that only accepts chunks of size < 100MB uncompressed.
My strategy using GZip was actually to open the gzipped CSV file, call readline(io) (which inflates a single line from the CSV file), open a new gzip file for writing, and write to it, line-by-line, until I have ~90MB uncompressed, which I could detect using position(buf), with buf being my new gzip file. I'd then do readall(tmp_file) and put that in the body of my web request.

Obviously my process above has some inefficiencies, particularly because nothing is in-memory or buffered. I don't think I can get around having to inflate, then deflate though so that I make sure to send the file in line-based chunks. I think I want to do something like

Open a gzipped CSV file, read a line (decompressing) and write it directly to a buffered stream (which re-compresses), and be able to track the uncompressed size of how big my buffered stream gets.

Any tips?

dcjones commented 9 years ago

Hmm interesting use case. I think something like this should do the trick.

using Libz, BufferedStreams

input = ZlibInflateInputStream(open(filename))
output_buffer = BufferedOutputStream()
output_stream = ZlibDeflateOutputStream(output_buffer)
block_size = 100000000

bytes_read = 0
for line in eachline(input)
    write(output_stream, line)
    bytes_read += length(line)

    if bytes_read > block_size
        close(output_stream)
        block = takebuf_array(output_buffer)
        # TODO: do something with block

        # open a new stream for the next block
        output_stream = ZlibDeflateOutputStream(output_buffer)
        bytes_read = 0
    end
end

# flush remaining data
close(output_stream)
block = takebuf_array(output_buffer)
# TODO: do something with block

dcjones commented 9 years ago

There's not a built in way to track the number of bytes written to an output stream currently, hence keeping track manually with bytes_read. I think that could be solved by implementing position() on BufferedOutputStream.

quinnj commented 8 years ago

Thanks BTW; this package is working great for me.

BioJulia / Libz.jl

Quick question on usage #3