fastruby / fast-ruby

:dash: Writing Fast Ruby :heart_eyes: -- Collect Common Ruby idioms.
https://github.com/fastruby/fast-ruby
5.67k stars 376 forks source link

Parallel versus sequential assignment is microbenchmark (not representative) and scales with number of inputs #208

Open schneems opened 2 years ago

schneems commented 2 years ago

Update: Parallel assignment doesn't allocate unless it's the last statement in a method/proc. There's a benchmark provided below. I'm

I think this benchmark https://github.com/fastruby/fast-ruby/blob/main/code/general/assignment.rb needs to be removed or at least heavily annotated on the readme. It is not representative of real code. In my experience, removing parallel assignments into sequential ones speeds up apps.

One reason is that parallel assignments [can] allocate an intermediate array [if it is the last statement in a method for example], the other is that the benchmark uses MANY assignments and the performance difference seems to scale with assignment numbers. In the wild the most common number is 2.

Also worth noting that by using integers, we're isolating to only the assignment time (mostly) but this distorts the possible impact of the code. If you switch to strings that allocate then the pressure on the GC actually causes sequential assignment to be faster (at least sometimes).

Benchmarks below.

Reasoning

In https://github.com/fastruby/fast-ruby/blob/main/code/general/assignment.rb it states that parallel assignment is much faster than sequential assignment, but if you have only 2 assignments, the difference is negligible:

require 'benchmark/ips'

def fast
  _a, _b= 1, 2
  nil
end

def slow
  _a = 1
  _b = 2
  nil
end

Benchmark.ips do |x|
  x.report('Parallel Assignment')   { fast }
  x.report('Sequential Assignment') { slow }
  x.compare!
end
Warming up --------------------------------------
 Parallel Assignment     1.494M i/100ms
Sequential Assignment
                         1.344M i/100ms
Calculating -------------------------------------
 Parallel Assignment     14.659M (± 4.9%) i/s -     73.229M in   5.007646s
Sequential Assignment
                         15.020M (± 4.0%) i/s -     75.261M in   5.019081s

Comparison:
Sequential Assignment: 15019526.2 i/s
 Parallel Assignment: 14658684.7 i/s - same-ish: difference falls within error

Note that sequential assignment is faster here (but within error). While the parallel assignment is fewer operations, it also introduces an intermediate array that requires an allocation:

$ cat scratch.rb
require 'memory_profiler'
report = MemoryProfiler.report do
  _a, _b = 1, 2
end

report.pretty_print
⛄️ 3.1.2 🚀 /tmp
$ ruby scratch.rb
Total allocated: 40 bytes (1 objects)
Total retained:  40 bytes (1 objects)

allocated memory by gem
-----------------------------------
        40  other

allocated memory by file
-----------------------------------
        40  scratch.rb

allocated memory by location
-----------------------------------
        40  scratch.rb:3

allocated memory by class
-----------------------------------
        40  Array
# ...

[Note that the above only works because the parallel assignment is returned _a, _b = 1, 2; nil does not allocate]

Even in the case of MANY sequential versus parallel assignments, if the function is doing anything non-trivial (such as allocating strings as opposed to using integers which are all singletons, then the effect of the "faster" parallel assignment is pretty much negligible. This run actually has sequential being faster:

require 'benchmark/ips'

def fast
  _a, _b, _c, _d, _e, _f, _g, _h = String.new("1"), String.new("2"), String.new("3"), String.new("4"), String.new("5"), String.new("6"), String.new("7"), String.new("8")
  nil
end

def slow
  _a = String.new("1")
  _b = String.new("2")
  _c = String.new("3")
  _d = String.new("4")
  _e = String.new("5")
  _f = String.new("6")
  _g = String.new("7")
  _h = String.new("8")
  nil
end

Benchmark.ips do |x|
  x.report('Parallel Assignment')   { fast }
  x.report('Sequential Assignment') { slow }
  x.compare!
end
Warming up --------------------------------------
 Parallel Assignment    80.318k i/100ms
Sequential Assignment
                        77.943k i/100ms
Calculating -------------------------------------
 Parallel Assignment    796.245k (± 3.7%) i/s -      4.016M in   5.050973s
Sequential Assignment
                        767.116k (± 5.9%) i/s -      3.897M in   5.099053s

Comparison:
 Parallel Assignment:   796245.3 i/s
Sequential Assignment:   767115.8 i/s - same-ish: difference falls within error

The sequential assignment is also faster here (but within error).

eregon commented 2 years ago

While the parallel assignment is fewer operations, it also introduces an intermediate array that requires an allocation:

But only because the code you used is:

report = MemoryProfiler.report do
  _a, _b = 1, 2
end

Can you try with:

report = MemoryProfiler.report do
  _a, _b = 1, 2
  nil
end

Then there should be no extra array allocation. And in fact the first benchmark's results almost prove there is no extra array, otherwise we'd see a significant slowdown.

I agree measuring with 2 elements is more representative.

I think the conclusion should be: sequential/parallel assignments are basically the same speed. However, this needs parallel assignment to not be the last expression, or in other words the method/block return value. In that case, the benefit of sequential parallel should be clearly visible.

FWIW, I personally think fast-ruby is not a good resource for "what one should use". It all depends on the Ruby version, implementation, context, how it's used, etc, etc.

eregon commented 2 years ago

Quick numbers:

require 'benchmark/ips'

def parallel
  _a, _b = 1, 2
  # nil
end

def sequential
  _a = 1
  _b = 2
  # nil
end

Benchmark.ips do |x|
  x.report('Parallel Assignment')   { parallel }
  x.report('Sequential Assignment') { sequential }
  x.compare!
end

With the nils uncommented:

 Parallel Assignment     16.221M (± 1.7%) i/s -     81.308M in   5.014057s
Sequential Assignment
                         16.024M (± 2.2%) i/s -     80.591M in   5.031993s

Comparison:
 Parallel Assignment: 16221113.6 i/s
Sequential Assignment: 16024025.0 i/s - same-ish: difference falls within error

So it's the same speed (and sequential is not any faster for this case, it'd just be noise).

With the nils commented:

 Parallel Assignment     11.434M (± 0.2%) i/s -     58.219M in   5.091781s
Sequential Assignment
                         16.122M (± 0.2%) i/s -     80.610M in   5.000155s

Comparison:
Sequential Assignment: 16121590.3 i/s
 Parallel Assignment: 11434027.0 i/s - 1.41x  (± 0.00) slower

As expected Sequential is much faster here, because Parallel must do an array allocation with the Parallel Assignment as last expression.

eregon commented 2 years ago

And with MemoryProfiler:

$ ruby -rmemory_profiler -e 'MemoryProfiler.report { _a, _b = 1, 2 }.pretty_print'
Total allocated: 40 bytes (1 objects)
Total retained:  40 bytes (1 objects)
...

$ ruby -rmemory_profiler -e 'MemoryProfiler.report { _a, _b = 1, 2; nil }.pretty_print'
Total allocated: 0 bytes (0 objects)
Total retained:  0 bytes (0 objects)
schneems commented 2 years ago

Thanks for catching that. The nil return nuance is new to me (from this morning even). I think maybe I associate parallel assignment with array allocation due to this profiling quirk. I'll update my main post to take that bit out.

eregon commented 2 years ago

This run actually has sequential being faster:

I believe that's just noise or too small to matter. I ran it 3 times locally, and the first I got Sequential 1.02x slower, 2nd and 3rd run same-ish: difference falls within error. The many allocations (BTW, 2 per element since no frozen_string_literal) completely bury the costs of the assignments, so that makes the benchmark a bit useless because it's mostly measuring those String allocations + GC.

I know you'd like Sequential Assignment to be faster, but it's not true/not that simple in general. It is very true if the Parallel Assignment is the last expression/a return value of a method/block, and this is worth making explicit about this benchmark. In other cases, they are basically the same and it's only a syntactic/bytecode difference. On CRuby Parallel Assignment is a bit faster likely due to bytecode details, but this "a bit faster" is likely too small to ever matter in practice.

For a general Ruby JIT, there should be no significant difference in generated code between both kind of assignments (for the non-last-expression case). For a Ruby JIT with inline and escape analysis, both kind of assignments are exactly the same, even if it's the last expression, as long as the JIT can see the result is unused (inlines enough to see that), as shown here: https://twitter.com/eregontp/status/1564950367485452288

eregon commented 2 years ago

In other cases, they are basically the same and it's only a syntactic/bytecode difference.

To illustrate this point, we can use ruby --dump=insns assign.rb on the original benchmark:

== disasm: #<ISeq:fast@assign.rb:3 (3,0)-(6,3)> (catch: FALSE)
local table (size: 8, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 8] _a@0       [ 7] _b@1       [ 6] _c@2       [ 5] _d@3       [ 4] _e@4       [ 3] _f@5       [ 2] _g@6       [ 1] _h@7
0000 putobject_INT2FIX_1_                                             (   4)[LiCa]
0001 putobject                              2
0003 putobject                              3
0005 putobject                              4
0007 putobject                              5
0009 putobject                              6
0011 putobject                              7
0013 putobject                              8
0015 setlocal_WC_0                          _h@7
0017 setlocal_WC_0                          _g@6
0019 setlocal_WC_0                          _f@5
0021 setlocal_WC_0                          _e@4
0023 setlocal_WC_0                          _d@3
0025 setlocal_WC_0                          _c@2
0027 setlocal_WC_0                          _b@1
0029 setlocal_WC_0                          _a@0
0031 putnil
0032 leave                                                            (   6)[Re]

== disasm: #<ISeq:slow@assign.rb:8 (8,0)-(18,3)> (catch: FALSE)
local table (size: 8, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 8] _a@0       [ 7] _b@1       [ 6] _c@2       [ 5] _d@3       [ 4] _e@4       [ 3] _f@5       [ 2] _g@6       [ 1] _h@7
0000 putobject_INT2FIX_1_                                             (   9)[LiCa]
0001 setlocal_WC_0                          _a@0
0003 putobject                              2                         (  10)[Li]
0005 setlocal_WC_0                          _b@1
0007 putobject                              3                         (  11)[Li]
0009 setlocal_WC_0                          _c@2
0011 putobject                              4                         (  12)[Li]
0013 setlocal_WC_0                          _d@3
0015 putobject                              5                         (  13)[Li]
0017 setlocal_WC_0                          _e@4
0019 putobject                              6                         (  14)[Li]
0021 setlocal_WC_0                          _f@5
0023 putobject                              7                         (  15)[Li]
0025 setlocal_WC_0                          _g@6
0027 putobject                              8                         (  16)[Li]
0029 setlocal_WC_0                          _h@7
0031 putnil
0032 leave                                                            (  18)[Re]

So it's literally the same instructions, just in a slightly different order in CRuby 3.1.2. Which then makes it a little bit surprising why we can even see Parallel Assignment faster (maybe some better cache locality or maybe consecutive instructions are done slightly better?).

schneems commented 2 years ago

I know you'd like Sequential Assignment to be faster, but it's not true/not that simple in general.

I understand the limits of the measurements and statistics. I'm able to detect a 1% request speedup with 99% confidence in a Rails app end to end. Via derailed https://github.com/zombocom/derailed_benchmarks/blob/0102612a9915179abca304ad0392aa84ed6bdf0e/lib/derailed_benchmarks/stats_from_dir.rb#L143-L165.

I generally try annotating with things like (but within error). But sometimes I miss a few.

If we're talking probability. I would say my prior was strong that parallel assignment allocates, then you came along and made a chip in that prior, then adam came along and told me what I wanted to hear. I'm not being intentionally misleading, but I am very much intentionally trying to prove or disprove that concept to myself.

On CRuby Parallel Assignment is a bit faster likely due to bytecode details, but this "a bit faster" is likely too small to ever matter in practice.

100% agreed here. It's an interesting trivia night question, but won't affect IRL performance.

So it's literally the same instructions, just in a slightly different order in CRuby 3.1.2. Which then makes it a little bit surprising why we can even see Parallel Assignment faster (maybe some better cache locality or maybe consecutive instructions are done slightly better?).

Fascinating 🤷🏻‍♂️. Thanks for all the extra info