Closed jiahao closed 9 years ago
Shouldn't the buffered version just be the default?
@jakebolewski yes it would be nice to make the buffered stream the default eventually. The interface is not complete though. (I didn't implement buffered writes)
Have you seen http://www.htslib.org/benchmarks/zlib.html?
On one hand, this is definitely a good direction to go. I would think/hope with some work we could reach gunzip speeds, but I haven't looked at this code in a long time, so I don't know where the gains would come from.
Rather than a separate name, I'd prefer something like a keyword parameter to open
/ gzopen
(e.g., buffered=true
).
For kicks, it would be good to test against the Zlib Stream API.
I'd also recommend just making GZipStream
buffered. There's no point in having two APIs, one slow and one fast, if the fast one can support all of the same functionality.
@stevengj @jakebolewski I'd like to merge this functionality as is and wait until we get buffered writes (#23) before we get rid of the current GZipStream
.
I like @kmsquire's idea of the keyword parameter to open/gzopen. It would probably help with the transition to buffered I/O.
Adding to the API only to remove it seems weird, why not just write the buffered write
? It will be ~30 lines of code.
It's not a priority for me now. You're more than welcome to implement it :)
I've profiled this code and it looks like ~40% of the execution time is spent in the bowels of iobuffer.jl
. Inlining the byte read (vide JuliaLang/julia#12364) may be able to cut down the overhead somewhat, but it looks like IOBuffer
is too heavy an abstraction for this problem.
Here is a reimplementation using a plain Vector{UInt8}
that achieves a 1/3 speedup relative to gunzip
:
using GZip
type GZBufferedStream <: IO
io::GZipStream
buf::Vector{UInt8}
len::Int
ptr::Int
function GZBufferedStream(io::GZipStream)
buf = Array(UInt8, io.buf_size)
len = ccall((:gzread, GZip._zlib), Int32,
(Ptr{Void}, Ptr{Void}, UInt32), io.gz_file, buf, io.buf_size)
new(io, buf, len, 1)
end
end
Base.close(io::GZBufferedStream) = close(io.io)
@inline function Base.read(io::GZBufferedStream, ::Type{UInt8})
c = io.buf[io.ptr]
io.ptr += 1
if io.ptr == io.len+1 #No more data
io.len = ccall((:gzread, GZip._zlib), Int32,
(Ptr{Void}, Ptr{Void}, UInt32), io.io.gz_file, io.buf, io.io.buf_size)
io.ptr = 1
end
c
end
Base.eof(io::GZBufferedStream) = io.len == 0
function bench()
io = GZBufferedStream(GZip.open("random.gz", "rb"))
thischar = 0x00
n = 0
while !eof(io)
thischar = read(io, UInt8)
n += 1
end
close(io)
println(n)
thischar
end
bench()
#random contains 10^8 rand(UInt8)s
@time run(`gunzip -k -f random.gz`) #255 ms
@time bench() #168 ms
Nice!
@jiahao why the close?
We also have Zlib.jl that looks like it does buffered reads / writes. It seems silly to have 2 packages with duplicated functionality.
@quinnj I was going to prepare a new PR based on the code snippet here: https://github.com/JuliaLang/GZip.jl/pull/32#issuecomment-125854042
It will take a significant rewrite of GZip.jl though.
@jakebolewski you're echoing #23.
Cool. I may take a stab at it in the next week or two since I want to really integrate with this in the CSV/ODBC/SQLite packages.
I just benchmarked Zlib.Reader
for the same random data and it took 9.38 seconds.
Hmm, that seems suspiciously close to the unbuffered timings you posted above.
I'm not sure why that is the case...
Also introduces new
gzbopen()
function that generatesGZipBufferedStream
s.read
ing a character stream from aGZipBufferedStream
is ~6x faster than from an ordinaryGZipStream
. The speedup comes from managing an internalIOBuffer
and populating it withgzread()
, rather than making a call to the unbufferedgzgetc()
each time a character is sought.Here are some benchmark numbers for how the performance varies with buffer size. 0 = ordinary
GZipStream
, sys = system time forgunzip
. 2^12 = 8192 is the default buffer size and 2^17 = 131072 is the "big" buffer size used by GZip.(
gunzip
is still 2.8x faster though...)Benchmark code: