crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.34k stars 1.62k forks source link

`Compress::Deflate::Writer` is slow #11614

Open asterite opened 2 years ago

asterite commented 2 years ago

Take this benchmark:

require "benchmark"
require "compress/zlib"

string = "The quick brown fox jumps over the lazy dog"
n = 100_000

Benchmark.bm do |bm|
  bm.report("deflating") do
    n.times do
      deflated = String.build do |io|
        Compress::Deflate::Writer.open(io) do |writer|
          writer.write(string.to_slice)
        end
      end
    end
  end
end

Output:

                user     system      total        real
deflating   2.015208   2.753990   4.769198 (  2.046066)

But if we change the way memory is allocated for zlib:

diff --git a/src/compress/deflate/writer.cr b/src/compress/deflate/writer.cr
index d72ef4586..6cc01509e 100644
--- a/src/compress/deflate/writer.cr
+++ b/src/compress/deflate/writer.cr
@@ -20,8 +20,8 @@ class Compress::Deflate::Writer < IO

     @buf = uninitialized UInt8[8192] # output buffer used by zlib
     @stream = LibZ::ZStream.new
-    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc(items * size) }
-    @stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) }
+    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| LibC.malloc(items * size) }
+    @stream.zfree = LibZ::FreeFunc.new { |opaque, address| LibC.free(address) }
     @closed = false
     ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL,
       strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream))

and run the benchmark again, we get:

                user     system      total        real
deflating   0.809320   0.613414   1.422734 (  0.869127)

I think Ruby uses xmalloc:

https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L632-L633

https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L608-L610

And a similar benchmark for Ruby:

require "benchmark"
require "zlib"

string = "The quick brown fox jumps over the lazy dog"
n = 100_000

Benchmark.bm do |bm|
  bm.report("deflating") do
    n.times { Zlib::Deflate.deflate(string) }
  end
end

gives:

       user     system      total        real
deflating  0.691750   0.079820   0.771570 (  0.774366)

So maybe that's why Ruby is faster?

We also have something similar in place for PCRE:

https://github.com/crystal-lang/crystal/blob/fdcbcf4149cda377070f0f1b41583f54bb24b3f0/src/regex/lib_pcre.cr#L31-L32

If I change those to LibC.malloc and LibC.free, and a similar benchmark where I test "foo".match(/o/), I also get a significant speedup.

So maybe as a first step we can change it to use LibC, then maybe consider xmalloc if available or if we make sure to distribute it (or maybe jemalloc, no idea which is faster and why)

BlobCodes commented 2 years ago

I cannot really reproduce the performance gains. Actually, while I tried this patch, the results got worse for me.

My benchmark code ```crystal require "benchmark" require "compress/zlib" class Compress::Deflate::WriterLibCMalloc < Compress::Deflate::Writer def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION, strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT, @sync_close : Bool = false, @dict : Bytes? = nil) unless -1 <= level <= 9 raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)") end @buf = uninitialized UInt8[8192] # output buffer used by zlib @stream = LibZ::ZStream.new @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| LibC.malloc(items * size) } @stream.zfree = LibZ::FreeFunc.new { |opaque, address| LibC.free(address) } @closed = false ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL, strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream)) raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok? end end string = "The quick brown fox jumps over the lazy dog" Benchmark.ips do |x| x.report("GC malloc/free") do deflated = String.build do |io| Compress::Deflate::Writer.open(io) do |writer| writer.write(string.to_slice) end end end x.report("LibC malloc/free") do deflated = String.build do |io| Compress::Deflate::WriterLibCMalloc.open(io) do |writer| writer.write(string.to_slice) end end end end ```

My results:

  GC malloc/free  48.44k ( 20.64µs) (± 3.19%)   271kB/op        fastest
LibC malloc/free  39.61k ( 25.25µs) (± 1.69%)  8.39kB/op   1.22× slower

The ruby version only takes around 4μs for me:

Warming up --------------------------------------
           deflating    25.395k i/100ms
Calculating -------------------------------------
           deflating    253.249k (± 0.6%) i/s -      1.270M in   5.014039s
asterite commented 2 years ago

OS? I'm using Mac

BlobCodes commented 2 years ago

OS? I'm using Mac

Fedora Linux 35.20211211.0 (Silverblue)

BlobCodes commented 2 years ago

But I just noticed the LibC malloc/free is faster for bigger inputs (I just tried the benchmark using string = "The quick brown fox jumps over the lazy dog" * 1000)

asterite commented 2 years ago

It would be nice to profile the benchmark program then, to see where most time is spent.

BlobCodes commented 2 years ago

Raw callgrind output: callgrind.zip

Bildschirmfoto vom 2021-12-17 22-36-54

BlobCodes commented 2 years ago

Oops, that were the results from string = "The quick brown fox jumps over the lazy dog" * 1000

These are the results for the short string: callgrind2.zip Bildschirmfoto vom 2021-12-17 22-51-36

dbackeus commented 2 years ago

Here are my results (running on MacOS 12):

without patch: 1.741514 2.111190 3.852704 ( 1.745935) with malloc patch: 0.801642 0.363411 1.165053 ( 0.855656) with patch + reducing buffer size to 1MB (faster than Ruby 🥳): 0.679146 0.097878 0.777024 ( 0.723027)

I also looked at the Writer implementation and picked the minimum amount of code needed for my very specific use-case (take an existing String and turn it into a deflated Slice, without supporting output larger than the buffer size):

def deflate(string : String)
  stream = LibZ::ZStream.new
  stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| LibC.malloc(items * size) }
  stream.zfree = LibZ::FreeFunc.new { |opaque, address| LibC.free(address) }
  ret = LibZ.deflateInit2(
    pointerof(stream),
    Compress::Deflate::DEFAULT_COMPRESSION,
    LibZ::Z_DEFLATED,
    -LibZ::MAX_BITS,
    LibZ::DEF_MEM_LEVEL,
    Compress::Deflate::Strategy::DEFAULT.value,
    LibZ.zlibVersion,
    sizeof(LibZ::ZStream),
  )
  raise "error in deflateInit2" unless ret.ok?

  string_slice = string.to_slice
  stream.avail_in = string_slice.size
  stream.next_in = string_slice

  buffer = uninitialized UInt8[1024]
  stream.next_out = buffer.to_unsafe
  stream.avail_out = buffer.size.to_u32

  LibZ.deflate(pointerof(stream), LibZ::Flush::FINISH)

  ret = LibZ.deflateEnd(pointerof(stream))
  raise "error in deflateEnd" unless ret.ok?

  buffer.to_slice[0, buffer.size - stream.avail_out]
end

Using this one I get even faster: 0.587206 0.042694 0.629900 ( 0.631345)

asterite commented 2 years ago

It's interesting because in the last benchmark one can see that GC_malloc takes a lot of time, being called by deflateInit2.

What's the benchmark of Libc.malloc is used? Maybe that in linux is still slow, but using another allocator could improve things.

BlobCodes commented 2 years ago

I think it may be useful to try the benchmark using musl libc. Maybe glibc is just slow here? As for other allocators, I have no idea how to try that.

didactic-drunk commented 2 years ago

Shouldn't GC.malloc be GC.malloc_atomic?

BlobCodes commented 2 years ago

@asterite

I think Ruby uses xmalloc:

https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L632-L633

https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L608-L610

I just looked at the code, it seems that's just LibC malloc. xmalloc2 is defined as ruby_xmalloc2: https://github.com/ruby/ruby/blob/cb4e2cb55a59833fc4f1f6db2f3082d1ffcafc80/include/ruby/internal/xmalloc.h#L54

ruby_xmalloc2 is just a normal malloc: https://github.com/ruby/ruby/blob/cb4e2cb55a59833fc4f1f6db2f3082d1ffcafc80/include/ruby/internal/xmalloc.h#L117-L119


So maybe as a first step we can change it to use LibC, then maybe consider xmalloc if available or if we make sure to distribute it (or maybe jemalloc, no idea which is faster and why)

My tests (Fedora Silverblue 36, in toolbox):

⬢[blobcodes@toolbox crystal]$ ./test-musl
                user     system      total        real
deflating   2.329043   3.422368   5.751411 (  2.094589)
⬢[blobcodes@toolbox crystal]$ ./test 
                user     system      total        real
deflating   2.329441   3.502011   5.831452 (  2.157158)
⬢[blobcodes@toolbox crystal]$ LD_PRELOAD=/usr/lib64/libmimalloc.so.2 ./test 
                user     system      total        real
deflating   2.954194   3.942054   6.896248 (  2.413770)
⬢[blobcodes@toolbox crystal]$ LD_PRELOAD=/usr/lib64/libjemalloc.so.2 ./test 
                user     system      total        real
deflating   3.233276   2.321265   5.554541 (  1.678961)

jemalloc seems to be the best in here, but not by a huge margin.

I also tried GC_malloc_atomic (which segfaulted) and GC_malloc_atomic_uncollectable (which was slow):

                user     system      total        real
deflating   5.884300   15.726243   21.610543 (  5.964796)

I also tried getting rid of the allocator to see what is even possible:

class Compress::Deflate::Writer < IO
  # If `#sync_close?` is `true`, closing this IO will close the underlying IO.
  property? sync_close : Bool

  # Using a preallocated memory area, only one writer can exist at a time
  @@og_memory = Pointer(Void).malloc(2_000_000_000)
  @@memory = Pointer(Void).null

  # Creates an instance of Flate::Writer. `close` must be invoked after all data
  # has written.
  def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION,
                 strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT,
                 @sync_close : Bool = false, @dict : Bytes? = nil)
    unless -1 <= level <= 9
      raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)")
    end

    @@memory = @@og_memory
    @buf = uninitialized UInt8[8192] # output buffer used by zlib
    @stream = LibZ::ZStream.new
    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| ret = @@memory; @@memory += ((items.to_u64 &* size.to_u64) &+ 7_u64) & -8; LibC.printf("ERROR") if @@memory > @@og_memory + 2_000_000_000; ret }
    @stream.zfree = LibZ::FreeFunc.new { |opaque, address| }
    @closed = false
    ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL,
      strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream))
    raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok?
  end
end

Results:

                user     system      total        real
deflating   1.895543   0.328011   2.223554 (  0.839783)

Normal benchmarks for comparison:

Normal LibC benchmark:

                user     system      total        real
deflating   0.975823   3.485814   4.461637 (  3.465718)

Using GC:

                user     system      total        real
deflating   2.312876   3.502425   5.815301 (  2.237870)
HertzDevil commented 2 years ago

It seems odd that GC.malloc_atomic crashes here.

BlobCodes commented 2 years ago

It seems odd that GC.malloc_atomic crashes here.

Not really, the GC just frees objects only referenced from objects which itself were allocated using malloc_atomic.

straight-shoota commented 1 year ago

@BlobCodes What exactly happens with malloc_atomic? And what's your code?

This patch works quite well:

diff --git i/src/compress/deflate/reader.cr w/src/compress/deflate/reader.cr
index cdb84bbe6..c704c65b6 100644
--- i/src/compress/deflate/reader.cr
+++ w/src/compress/deflate/reader.cr
@@ -24,3 +24,3 @@ class Compress::Deflate::Reader < IO
     @stream = LibZ::ZStream.new
-    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc(items * size) }
+    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc_atomic(items * size) }
     @stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) }
diff --git i/src/compress/deflate/writer.cr w/src/compress/deflate/writer.cr
index d72ef4586..e044a4b72 100644
--- i/src/compress/deflate/writer.cr
+++ w/src/compress/deflate/writer.cr
@@ -22,3 +22,3 @@ class Compress::Deflate::Writer < IO
     @stream = LibZ::ZStream.new
-    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc(items * size) }
+    @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc_atomic(items * size) }
     @stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) }

I think GC.malloc_atomic would probably be better than LibC.malloc (as used in #12457). We're using GC for almost all allocations. And it makes sense not to distribute across multiple allocators because that just spreads the free memory in a way that the program might use more memory than necessary.

BlobCodes commented 1 year ago

@straight-shoota

I think GC.malloc_atomic would probably be better than LibC.malloc

Those C libraries probably store pointers to allocated memory inside allocated memory (which happens very easily for example with linked lists). Allocating them with malloc_atomic means that they will be collected by the GC but not scanned for pointers. When the GC pressure is high enough, this memory (which is still in use) will be collected.

The current way the bindings work (using GC.malloc) is also not equivalent to C behaviour and it is quite impressive it doesn't cause any issues.

The BDWGC function (mostly) equivalent to LibC.malloc is GC_malloc_atomic_uncollectable (allocates non-GCd memory, not scanned for pointers), but that is very slow (as shown in my tests above).

The issue I faced when using malloc_atomic was segmentation faults. To test this yourself, you can try encoding a very large string (ex. "test string" * 10000) and setting the environment variable GC_FREE_SPACE_DIVISOR = 100 (which will cause the GC to collect very often and aggressively). The chance of getting a segfault should be very high this way.

Code to quickly test using `crystal eval` ```bash GC_FREE_SPACE_DIVISOR=1000 crystal eval ' require "benchmark" require "compress/zlib" class Compress::Deflate::Writer < IO def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION, strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT, @sync_close : Bool = false, @dict : Bytes? = nil) unless -1 <= level <= 9 raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)") end @buf = uninitialized UInt8[8192] # output buffer used by zlib @stream = LibZ::ZStream.new @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc_atomic(items * size) } @stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) } @closed = false ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL, strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream)) raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok? end end string = "The quick brown fox jumps over the lazy dog" * 100000 n = 100_000 Benchmark.bm do |bm| bm.report("deflating") do n.times do deflated = String.build do |io| Compress::Deflate::Writer.open(io) do |writer| writer.write(string.to_slice) end end end end end ' ```

And what's your code?

I basically used the same patch as you along with the benchmark in the issue description, only with a larger string to cause GC pressure.

What exactly happens with malloc_atomic?

Error message ``` Invalid memory access (signal 11) at address 0x20 [0x475bd6] *Exception::CallStack::print_backtrace:Nil +118 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x463f2a] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +330 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x7fc4b6f4da70] ?? +140482859817584 in /lib64/libc.so.6 [0x509ba9] GC_free +41 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x48f0f6] *GC::free:Nil +6 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x4642dd] ~procProc(Pointer(Void), Pointer(Void), Nil) +13 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x7fc4b72ec405] deflateEnd +53 in /lib64/libz.so.1 [0x4f1128] *Compress::Deflate::Writer#close:Nil +120 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x464266] ~procProc(Nil) +374 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x4ef9b3] *Benchmark::BM::Job#execute:Nil +803 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x450e2f] __crystal_main +1407 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x503586] *Crystal::main_user_code:Nil +6 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x5034ee] *Crystal::main:Int32 +62 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x45e436] main +6 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x7fc4b6f38550] ?? +140482859730256 in /lib64/libc.so.6 [0x7fc4b6f38609] __libc_start_main +137 in /lib64/libc.so.6 [0x4507e5] _start +37 in /var/home/blobcodes/.cache/crystal/crystal-run-eval.tmp [0x0] ??? ```
BlobCodes commented 1 year ago

Actually, the performance difference between crystal and ruby is not that big.

Here's a benchmark using Benchmark.ips for both the given example string (43 bytes) and a more realistic size (430kb = example string * 10000)

Benchmark Code ```crystal require "benchmark" require "compress/zlib" class Compress::Deflate::Writer::LibCMalloc < Compress::Deflate::Writer def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION, strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT, @sync_close : Bool = false, @dict : Bytes? = nil) unless -1 <= level <= 9 raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)") end @buf = uninitialized UInt8[8192] # output buffer used by zlib @stream = LibZ::ZStream.new @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| LibC.malloc(items &* size) } @stream.zfree = LibZ::FreeFunc.new { |opaque, address| LibC.free(address) } @closed = false ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL, strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream)) raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok? end end class Compress::Deflate::Writer::NoMalloc < Compress::Deflate::Writer # Using a preallocated memory area, only one writer can exist at a time @@og_memory = Pointer(Void).malloc(2_000_000_000) @@memory = Pointer(Void).null def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION, strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT, @sync_close : Bool = false, @dict : Bytes? = nil) unless -1 <= level <= 9 raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)") end @@memory = @@og_memory @buf = uninitialized UInt8[8192] # output buffer used by zlib @stream = LibZ::ZStream.new @stream.zalloc = LibZ::AllocFunc.new do |opaque, items, size| ret = @@memory @@memory += ((items.to_u64 &* size.to_u64) &+ 7_u64) & -8 LibC.printf("ERROR") if @@memory > @@og_memory + 2_000_000_000 ret end @stream.zfree = LibZ::FreeFunc.new { |opaque, address| } @closed = false ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL, strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream)) raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok? end end string = "The quick brown fox jumps over the lazy dog" * 1 output = "" puts "Benchmark for small input:" Benchmark.ips do |x| x.report("gc malloc") { output = String.build { |io| Compress::Deflate::Writer.open(io) { |writer| writer.write(string.to_slice) } } } x.report("libc malloc") { output = String.build { |io| Compress::Deflate::Writer::LibCMalloc.open(io) { |writer| writer.write(string.to_slice) } } } x.report("no malloc") { output = String.build { |io| Compress::Deflate::Writer::NoMalloc.open(io) { |writer| writer.write(string.to_slice) } } } end string = "The quick brown fox jumps over the lazy dog" * 10000 output = "" puts "\nBenchmark for normal input:" Benchmark.ips do |x| x.report("gc malloc") { output = String.build { |io| Compress::Deflate::Writer.open(io) { |writer| writer.write(string.to_slice) } } } x.report("libc malloc") { output = String.build { |io| Compress::Deflate::Writer::LibCMalloc.open(io) { |writer| writer.write(string.to_slice) } } } x.report("no malloc") { output = String.build { |io| Compress::Deflate::Writer::NoMalloc.open(io) { |writer| writer.write(string.to_slice) } } } end ```

My results to this benchmark are:

Benchmark for small input:
  gc malloc 113.64k (  8.80µs) (±14.15%)   271kB/op   1.77× slower
libc malloc  34.95k ( 28.61µs) (± 7.97%)  8.39kB/op   5.75× slower
  no malloc 200.89k (  4.98µs) (±20.30%)  8.39kB/op        fastest

Benchmark for normal input:
  gc malloc 704.74  (  1.42ms) (± 0.50%)   274kB/op   1.01× slower
libc malloc 678.36  (  1.47ms) (± 0.55%)  12.4kB/op   1.05× slower
  no malloc 711.63  (  1.41ms) (± 3.10%)  12.4kB/op        fastest

..and now the ruby benchmark (using gem benchmark-ips):

Ruby Benchmark Code ```ruby require "benchmark/ips" require "zlib" string = "The quick brown fox jumps over the lazy dog" * 1 output = "" puts "Benchmark for small input:" Benchmark.ips do |x| x.report("zlib write (original)") { output = Zlib::Deflate.deflate(string) } x.report("zlib write (using new)") { output = Zlib::Deflate.new.deflate(string) } end string = "The quick brown fox jumps over the lazy dog" * 10000 output = "" puts "\nBenchmark for normal input:" Benchmark.ips do |x| x.report("zlib write (original)") { output = Zlib::Deflate.deflate(string) } x.report("zlib write (using new)") { output = Zlib::Deflate.new.deflate(string) } end ```

My results to this benchmark are:

Benchmark for small input:
Warming up --------------------------------------
zlib write (original)
                        23.484k i/100ms
zlib write (using new)
                        13.995k i/100ms
Calculating -------------------------------------
zlib write (original)
                        237.280k (± 1.2%) i/s -      1.198M in   5.048318s
zlib write (using new)
                        151.681k (± 8.0%) i/s -    755.730k in   5.013517s

Benchmark for normal input:
Warming up --------------------------------------
zlib write (original)
                        59.000  i/100ms
zlib write (using new)
                        65.000  i/100ms
Calculating -------------------------------------
zlib write (original)
                        668.743  (± 0.4%) i/s -      3.363k in   5.028931s
zlib write (using new)
                        655.595  (± 0.9%) i/s -      3.315k in   5.056929s

For the normal-sized input (430kb) ruby took 1/668.743 seconds = 1.49ms while the current crystal implementation took 1.42ms. So in normal scenarios where you would actually consider using deflate, crystal is faster.

For the small input (43b) ruby took 1/237280 seconds = 4.21μs while the current crystal implementation took 8.8μs. However, as you can see there are two ruby benchmarks! The class method Deflate.deflate works very differently to Deflate#deflate. Deflate#deflate actually takes 6.59μs. The remaining difference probably has something to do with the overhead caused by String.build versus just returning a String.