Open asterite opened 2 years ago
I cannot really reproduce the performance gains. Actually, while I tried this patch, the results got worse for me.
My results:
GC malloc/free 48.44k ( 20.64µs) (± 3.19%) 271kB/op fastest
LibC malloc/free 39.61k ( 25.25µs) (± 1.69%) 8.39kB/op 1.22× slower
The ruby version only takes around 4μs for me:
Warming up --------------------------------------
deflating 25.395k i/100ms
Calculating -------------------------------------
deflating 253.249k (± 0.6%) i/s - 1.270M in 5.014039s
OS? I'm using Mac
OS? I'm using Mac
Fedora Linux 35.20211211.0 (Silverblue)
But I just noticed the LibC malloc/free is faster for bigger inputs (I just tried the benchmark using string = "The quick brown fox jumps over the lazy dog" * 1000
)
It would be nice to profile the benchmark program then, to see where most time is spent.
Raw callgrind output: callgrind.zip
Oops, that were the results from string = "The quick brown fox jumps over the lazy dog" * 1000
These are the results for the short string: callgrind2.zip
Here are my results (running on MacOS 12):
without patch:
1.741514 2.111190 3.852704 ( 1.745935)
with malloc patch:
0.801642 0.363411 1.165053 ( 0.855656)
with patch + reducing buffer size to 1MB (faster than Ruby 🥳):
0.679146 0.097878 0.777024 ( 0.723027)
I also looked at the Writer implementation and picked the minimum amount of code needed for my very specific use-case (take an existing String and turn it into a deflated Slice, without supporting output larger than the buffer size):
def deflate(string : String)
stream = LibZ::ZStream.new
stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| LibC.malloc(items * size) }
stream.zfree = LibZ::FreeFunc.new { |opaque, address| LibC.free(address) }
ret = LibZ.deflateInit2(
pointerof(stream),
Compress::Deflate::DEFAULT_COMPRESSION,
LibZ::Z_DEFLATED,
-LibZ::MAX_BITS,
LibZ::DEF_MEM_LEVEL,
Compress::Deflate::Strategy::DEFAULT.value,
LibZ.zlibVersion,
sizeof(LibZ::ZStream),
)
raise "error in deflateInit2" unless ret.ok?
string_slice = string.to_slice
stream.avail_in = string_slice.size
stream.next_in = string_slice
buffer = uninitialized UInt8[1024]
stream.next_out = buffer.to_unsafe
stream.avail_out = buffer.size.to_u32
LibZ.deflate(pointerof(stream), LibZ::Flush::FINISH)
ret = LibZ.deflateEnd(pointerof(stream))
raise "error in deflateEnd" unless ret.ok?
buffer.to_slice[0, buffer.size - stream.avail_out]
end
Using this one I get even faster: 0.587206 0.042694 0.629900 ( 0.631345)
It's interesting because in the last benchmark one can see that GC_malloc takes a lot of time, being called by deflateInit2.
What's the benchmark of Libc.malloc is used? Maybe that in linux is still slow, but using another allocator could improve things.
I think it may be useful to try the benchmark using musl libc. Maybe glibc is just slow here? As for other allocators, I have no idea how to try that.
Shouldn't GC.malloc
be GC.malloc_atomic
?
@asterite
I think Ruby uses xmalloc:
https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L632-L633
https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L608-L610
I just looked at the code, it seems that's just LibC malloc.
xmalloc2
is defined as ruby_xmalloc2
:
https://github.com/ruby/ruby/blob/cb4e2cb55a59833fc4f1f6db2f3082d1ffcafc80/include/ruby/internal/xmalloc.h#L54
ruby_xmalloc2
is just a normal malloc
:
https://github.com/ruby/ruby/blob/cb4e2cb55a59833fc4f1f6db2f3082d1ffcafc80/include/ruby/internal/xmalloc.h#L117-L119
So maybe as a first step we can change it to use LibC, then maybe consider xmalloc if available or if we make sure to distribute it (or maybe jemalloc, no idea which is faster and why)
My tests (Fedora Silverblue 36, in toolbox):
⬢[blobcodes@toolbox crystal]$ ./test-musl
user system total real
deflating 2.329043 3.422368 5.751411 ( 2.094589)
⬢[blobcodes@toolbox crystal]$ ./test
user system total real
deflating 2.329441 3.502011 5.831452 ( 2.157158)
⬢[blobcodes@toolbox crystal]$ LD_PRELOAD=/usr/lib64/libmimalloc.so.2 ./test
user system total real
deflating 2.954194 3.942054 6.896248 ( 2.413770)
⬢[blobcodes@toolbox crystal]$ LD_PRELOAD=/usr/lib64/libjemalloc.so.2 ./test
user system total real
deflating 3.233276 2.321265 5.554541 ( 1.678961)
jemalloc seems to be the best in here, but not by a huge margin.
I also tried GC_malloc_atomic (which segfaulted) and GC_malloc_atomic_uncollectable (which was slow):
user system total real
deflating 5.884300 15.726243 21.610543 ( 5.964796)
I also tried getting rid of the allocator to see what is even possible:
class Compress::Deflate::Writer < IO
# If `#sync_close?` is `true`, closing this IO will close the underlying IO.
property? sync_close : Bool
# Using a preallocated memory area, only one writer can exist at a time
@@og_memory = Pointer(Void).malloc(2_000_000_000)
@@memory = Pointer(Void).null
# Creates an instance of Flate::Writer. `close` must be invoked after all data
# has written.
def initialize(@output : IO, level : Int32 = Compress::Deflate::DEFAULT_COMPRESSION,
strategy : Compress::Deflate::Strategy = Compress::Deflate::Strategy::DEFAULT,
@sync_close : Bool = false, @dict : Bytes? = nil)
unless -1 <= level <= 9
raise ArgumentError.new("Invalid Flate level: #{level} (must be in -1..9)")
end
@@memory = @@og_memory
@buf = uninitialized UInt8[8192] # output buffer used by zlib
@stream = LibZ::ZStream.new
@stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| ret = @@memory; @@memory += ((items.to_u64 &* size.to_u64) &+ 7_u64) & -8; LibC.printf("ERROR") if @@memory > @@og_memory + 2_000_000_000; ret }
@stream.zfree = LibZ::FreeFunc.new { |opaque, address| }
@closed = false
ret = LibZ.deflateInit2(pointerof(@stream), level, LibZ::Z_DEFLATED, -LibZ::MAX_BITS, LibZ::DEF_MEM_LEVEL,
strategy.value, LibZ.zlibVersion, sizeof(LibZ::ZStream))
raise Compress::Deflate::Error.new(ret, @stream) unless ret.ok?
end
end
Results:
user system total real
deflating 1.895543 0.328011 2.223554 ( 0.839783)
Normal benchmarks for comparison:
Normal LibC
benchmark:
user system total real
deflating 0.975823 3.485814 4.461637 ( 3.465718)
Using GC:
user system total real
deflating 2.312876 3.502425 5.815301 ( 2.237870)
It seems odd that GC.malloc_atomic
crashes here.
It seems odd that
GC.malloc_atomic
crashes here.
Not really, the GC just frees objects only referenced from objects which itself were allocated using malloc_atomic.
@BlobCodes What exactly happens with malloc_atomic
? And what's your code?
This patch works quite well:
diff --git i/src/compress/deflate/reader.cr w/src/compress/deflate/reader.cr
index cdb84bbe6..c704c65b6 100644
--- i/src/compress/deflate/reader.cr
+++ w/src/compress/deflate/reader.cr
@@ -24,3 +24,3 @@ class Compress::Deflate::Reader < IO
@stream = LibZ::ZStream.new
- @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc(items * size) }
+ @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc_atomic(items * size) }
@stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) }
diff --git i/src/compress/deflate/writer.cr w/src/compress/deflate/writer.cr
index d72ef4586..e044a4b72 100644
--- i/src/compress/deflate/writer.cr
+++ w/src/compress/deflate/writer.cr
@@ -22,3 +22,3 @@ class Compress::Deflate::Writer < IO
@stream = LibZ::ZStream.new
- @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc(items * size) }
+ @stream.zalloc = LibZ::AllocFunc.new { |opaque, items, size| GC.malloc_atomic(items * size) }
@stream.zfree = LibZ::FreeFunc.new { |opaque, address| GC.free(address) }
I think GC.malloc_atomic
would probably be better than LibC.malloc
(as used in #12457). We're using GC for almost all allocations. And it makes sense not to distribute across multiple allocators because that just spreads the free memory in a way that the program might use more memory than necessary.
@straight-shoota
I think
GC.malloc_atomic
would probably be better thanLibC.malloc
Those C libraries probably store pointers to allocated memory inside allocated memory (which happens very easily for example with linked lists). Allocating them with malloc_atomic
means that they will be collected by the GC but not scanned for pointers. When the GC pressure is high enough, this memory (which is still in use) will be collected.
The current way the bindings work (using GC.malloc
) is also not equivalent to C behaviour and it is quite impressive it doesn't cause any issues.
The BDWGC function (mostly) equivalent to LibC.malloc
is GC_malloc_atomic_uncollectable
(allocates non-GCd memory, not scanned for pointers), but that is very slow (as shown in my tests above).
The issue I faced when using malloc_atomic
was segmentation faults.
To test this yourself, you can try encoding a very large string (ex. "test string" * 10000
) and setting the environment variable GC_FREE_SPACE_DIVISOR = 100
(which will cause the GC to collect very often and aggressively). The chance of getting a segfault should be very high this way.
And what's your code?
I basically used the same patch as you along with the benchmark in the issue description, only with a larger string to cause GC pressure.
What exactly happens with malloc_atomic?
Actually, the performance difference between crystal and ruby is not that big.
Here's a benchmark using Benchmark.ips
for both the given example string (43 bytes) and a more realistic size (430kb = example string * 10000)
My results to this benchmark are:
Benchmark for small input:
gc malloc 113.64k ( 8.80µs) (±14.15%) 271kB/op 1.77× slower
libc malloc 34.95k ( 28.61µs) (± 7.97%) 8.39kB/op 5.75× slower
no malloc 200.89k ( 4.98µs) (±20.30%) 8.39kB/op fastest
Benchmark for normal input:
gc malloc 704.74 ( 1.42ms) (± 0.50%) 274kB/op 1.01× slower
libc malloc 678.36 ( 1.47ms) (± 0.55%) 12.4kB/op 1.05× slower
no malloc 711.63 ( 1.41ms) (± 3.10%) 12.4kB/op fastest
..and now the ruby benchmark (using gem benchmark-ips
):
My results to this benchmark are:
Benchmark for small input:
Warming up --------------------------------------
zlib write (original)
23.484k i/100ms
zlib write (using new)
13.995k i/100ms
Calculating -------------------------------------
zlib write (original)
237.280k (± 1.2%) i/s - 1.198M in 5.048318s
zlib write (using new)
151.681k (± 8.0%) i/s - 755.730k in 5.013517s
Benchmark for normal input:
Warming up --------------------------------------
zlib write (original)
59.000 i/100ms
zlib write (using new)
65.000 i/100ms
Calculating -------------------------------------
zlib write (original)
668.743 (± 0.4%) i/s - 3.363k in 5.028931s
zlib write (using new)
655.595 (± 0.9%) i/s - 3.315k in 5.056929s
For the normal-sized input (430kb) ruby took 1/668.743
seconds = 1.49ms
while the current crystal implementation took 1.42ms
. So in normal scenarios where you would actually consider using deflate, crystal is faster.
For the small input (43b) ruby took 1/237280
seconds = 4.21μs
while the current crystal implementation took 8.8μs
. However, as you can see there are two ruby benchmarks! The class method Deflate.deflate
works very differently to Deflate#deflate
. Deflate#deflate
actually takes 6.59μs
. The remaining difference probably has something to do with the overhead caused by String.build
versus just returning a String
.
Take this benchmark:
Output:
But if we change the way memory is allocated for zlib:
and run the benchmark again, we get:
I think Ruby uses
xmalloc
:https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L632-L633
https://github.com/ruby/ruby/blob/6a1365d725c72107dd45e1b06ce4acc5549ebaf5/ext/zlib/zlib.c#L608-L610
And a similar benchmark for Ruby:
gives:
So maybe that's why Ruby is faster?
We also have something similar in place for PCRE:
https://github.com/crystal-lang/crystal/blob/fdcbcf4149cda377070f0f1b41583f54bb24b3f0/src/regex/lib_pcre.cr#L31-L32
If I change those to
LibC.malloc
andLibC.free
, and a similar benchmark where I test"foo".match(/o/)
, I also get a significant speedup.So maybe as a first step we can change it to use
LibC
, then maybe considerxmalloc
if available or if we make sure to distribute it (or maybejemalloc
, no idea which is faster and why)