Add GZipBufferedStream for buffered reading

jiahao commented 9 years ago

Also introduces new gzbopen() function that generates GZipBufferedStreams.

reading a character stream from a GZipBufferedStream is ~6x faster than from an ordinary GZipStream. The speedup comes from managing an internal IOBuffer and populating it with gzread(), rather than making a call to the unbuffered gzgetc() each time a character is sought.

Here are some benchmark numbers for how the performance varies with buffer size. 0 = ordinary GZipStream, sys = system time for gunzip. 2^12 = 8192 is the default buffer size and 2^17 = 131072 is the "big" buffer size used by GZip.

(gunzip is still 2.8x faster though...)

log2(buf_size)	time(s)
0	9.434
12 (default)	1.484
13	1.505
14	1.504
15	1.479
16	1.504
17	1.497
18	1.481
19	1.496
20	1.497
21	1.487
22	1.467
23	1.470
24	1.442
25	1.417
26	1.376
27	1.310
28 (entire file fits)	1.264
sys	0.519

Benchmark code:

function benchmark(nreps=10)
  filename = "ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz"
  isfile(filename) || download("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz")

  timings=Dict{Int,Vector}()
  for i=1:nreps
    gzbopen(filename) do io
        n = 0
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[0] = push!(get(timings, 0, Float64[]), t)
    end

    for sexp in 28:-1:12
        println("Buffer size: ", 2^sexp)
        n = 0
        io = gzbopen(filename, "rb", 2^sexp)
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[sexp] = push!(get(timings, sexp, Float64[]), t)
        close(io)
    end

    for k in sort!(collect(keys(timings)))
        println(k, '\t', minimum(timings[k]))
    end
  end
end

jakebolewski commented 9 years ago

Shouldn't the buffered version just be the default?

jiahao commented 9 years ago

@jakebolewski yes it would be nice to make the buffered stream the default eventually. The interface is not complete though. (I didn't implement buffered writes)

jakebolewski commented 9 years ago

Have you seen http://www.htslib.org/benchmarks/zlib.html?

kmsquire commented 9 years ago

On one hand, this is definitely a good direction to go. I would think/hope with some work we could reach gunzip speeds, but I haven't looked at this code in a long time, so I don't know where the gains would come from.

Rather than a separate name, I'd prefer something like a keyword parameter to open / gzopen (e.g., buffered=true).

For kicks, it would be good to test against the Zlib Stream API.

stevengj commented 9 years ago

I'd also recommend just making GZipStream buffered. There's no point in having two APIs, one slow and one fast, if the fast one can support all of the same functionality.

jiahao commented 9 years ago

@stevengj @jakebolewski I'd like to merge this functionality as is and wait until we get buffered writes (#23) before we get rid of the current GZipStream.

jiahao commented 9 years ago

I like @kmsquire's idea of the keyword parameter to open/gzopen. It would probably help with the transition to buffered I/O.

jakebolewski commented 9 years ago

Adding to the API only to remove it seems weird, why not just write the buffered write? It will be ~30 lines of code.

jiahao commented 9 years ago

It's not a priority for me now. You're more than welcome to implement it :)

jiahao commented 9 years ago

I've profiled this code and it looks like ~40% of the execution time is spent in the bowels of iobuffer.jl. Inlining the byte read (vide JuliaLang/julia#12364) may be able to cut down the overhead somewhat, but it looks like IOBuffer is too heavy an abstraction for this problem.

Here is a reimplementation using a plain Vector{UInt8} that achieves a 1/3 speedup relative to gunzip:

using GZip

type GZBufferedStream <: IO
    io::GZipStream
    buf::Vector{UInt8}
    len::Int
    ptr::Int

    function GZBufferedStream(io::GZipStream)
        buf = Array(UInt8, io.buf_size)

        len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.gz_file, buf, io.buf_size)

        new(io, buf, len, 1)
    end
end

Base.close(io::GZBufferedStream) = close(io.io)

@inline function Base.read(io::GZBufferedStream, ::Type{UInt8})
    c = io.buf[io.ptr]
    io.ptr += 1

    if io.ptr == io.len+1 #No more data
        io.len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.io.gz_file, io.buf, io.io.buf_size)
        io.ptr = 1
    end
    c
end

Base.eof(io::GZBufferedStream) = io.len == 0

function bench()
    io = GZBufferedStream(GZip.open("random.gz", "rb"))

    thischar = 0x00
    n = 0
    while !eof(io)
        thischar = read(io, UInt8)
        n += 1
    end
    close(io)

    println(n)
    thischar
end

bench()

#random contains 10^8 rand(UInt8)s
@time run(`gunzip -k -f random.gz`) #255 ms
@time bench() #168 ms

kmsquire commented 9 years ago

Nice!

quinnj commented 9 years ago

@jiahao why the close?

jakebolewski commented 9 years ago

We also have Zlib.jl that looks like it does buffered reads / writes. It seems silly to have 2 packages with duplicated functionality.

jiahao commented 9 years ago

@quinnj I was going to prepare a new PR based on the code snippet here: https://github.com/JuliaLang/GZip.jl/pull/32#issuecomment-125854042

It will take a significant rewrite of GZip.jl though.

@jakebolewski you're echoing #23.

quinnj commented 9 years ago

Cool. I may take a stab at it in the next week or two since I want to really integrate with this in the CSV/ODBC/SQLite packages.

jiahao commented 9 years ago

I just benchmarked Zlib.Reader for the same random data and it took 9.38 seconds.

jakebolewski commented 9 years ago

Hmm, that seems suspiciously close to the unbuffered timings you posted above.

jiahao commented 9 years ago

I'm not sure why that is the case...

JuliaIO / GZip.jl

Add GZipBufferedStream for buffered reading #32