1641 Fibers created by hexapdf per iteration for yjit-bench benchmark

eregon commented 1 year ago

Hello,

I'm investigating a performance issue for TruffleRuby on the hexapdf yjit-bench benchmark and I noticed through this change to https://github.com/Shopify/yjit-bench/blob/main/benchmarks/hexapdf/benchmark.rb :

diff --git a/benchmarks/hexapdf/benchmark.rb b/benchmarks/hexapdf/benchmark.rb
index ff0aa71..d5367ad 100644
--- a/benchmarks/hexapdf/benchmark.rb
+++ b/benchmarks/hexapdf/benchmark.rb
@@ -10,6 +10,15 @@ use_gemfile
 require "hexapdf"
 require "fileutils"

+class << Fiber
+  alias_method :_new, :new
+  def new(...)
+    r = _new(...)
+    p r
+    r
+  end
+end
+
 IN_FILENAME = "odyssey.txt"
 WIDTH = 50
 HEIGHT = 1000

That each iteration of that benchmark, which renders the Odyssey once per iteration, creates 1641 Fibers (I checked on both CRuby 3.1 and TruffleRuby, counted by xsel -b | grep Fiber | wc -l for the output between 2 iterations). It looks like:

iter #  5: 5367ms, mad=0.2452/0.0000, median=8466ms
#<HexaPDF::FiberWithLength:0x315c8 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x315e8 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x315f8 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x31618 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x31628 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x31648 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x31658 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x31678 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
...
#<Fiber:0x3adf8 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3ae08 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x3ae28 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3ae38 @length=-1 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x3ae58 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3ae68 @length=36814 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3ae88 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3ae98 @length=30319 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3aeb8 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3aec8 @length=30518 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3aee8 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3aef8 @length=30721 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3af18 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3af28 @length=4142 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3af48 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
#<HexaPDF::FiberWithLength:0x3af58 @length=11480 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter.rb:105 (created)>
#<Fiber:0x3af88 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/predictor.rb:151 (created)>
#<Fiber:0x3af98 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/truffleruby/3.1.3.5/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>
iter #  6: 5091ms, mad=0.2961/0.0000, median=7428ms

Is this intended? Is it something that changed in hexapdf or it was always like that?

It's not necessarily a problem, but I was surprised to see so many Fibers created. I suspect the overhead of creating all these Fibers and passing data around could be significant, even on CRuby. On TruffleRuby/JRuby it's a much higher overhead currently due to Fibers using a real thread (Loom is still experimental in the last JDK LTS, 17).

Maybe some of these Fibers could be reused, or it could be written in a way e.g. that a block is passed and that block would be called instead of Fiber.yield?

eregon commented 1 year ago

FWIW a bit more than a year ago it did run quite a bit faster on TruffleRuby so at least the Fibers don't seem to prevent overall speedups. But they might be the big part of the noise (error bars) there, creating many native threads is quite noisy performance-wise. And of course they might limit the speedups by JITs as well.

eregon commented 1 year ago

https://gist.github.com/eregon/fa8ea01f1b2d2a2c1d60f68e0886f7f9 is the full list of Fibers on CRuby for one iteration, in case it's useful. It seems mostly Fibers created at these two places:

#<HexaPDF::FiberWithLength:0x00007fc2897c69d0 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/ruby/3.1.0/gems/hexapdf-0.16.0/lib/hexapdf/content/canvas.rb:206 (created)>
#<Fiber:0x00007fc2897c6728 /home/eregon/code/yjit-bench/benchmarks/hexapdf/vendor/bundle/ruby/3.1.0/gems/hexapdf-0.16.0/lib/hexapdf/filter/flate_decode.rb:89 (created)>

gettalong commented 1 year ago

I don't think that anything has changed in HexaPDF regarding its use of Fibers. They are mainly used for parsing and serializing stream objects, i.e. parts of the PDF that potentially contain large amounts of (binary) data.

In context of the benchmark one fiber is used per content stream (i.e. page) to flate encode it. So my guess is that the PDF contains around 1641 pages.

The reason for using fibers instead of blocks is that fibers can be restarted anywhere during their execution. This allows HexaPDF to read and process streams in chunks without greatly altering the underlying algorithmic flow.

eregon commented 1 year ago

Maybe some of these Fibers could be reused, or it could be written in a way e.g. that a block is passed and that block would be called instead of Fiber.yield?

Ah BTW something else that could be useful is Enumerator.new { |y| ... y << 42 ... }, as long as next/peek is not used that also doesn't use a Fiber, but it has all Enumerable methods available.

eregon commented 1 year ago

The resulting PDF has 814 pages, so it seems like 2 Fibers per page.

gettalong commented 1 year ago

The resulting PDF has 814 pages, so it seems like 2 Fibers per page.

Yeah, that may also be the case, one providing the content and the other flate-encoding it.

eregon commented 1 year ago

The reason for using fibers instead of blocks is that fibers can be restarted anywhere during their execution. This allows HexaPDF to read and process streams in chunks without greatly altering the underlying algorithmic flow.

Right. I wonder how much one would need to change the code to avoid some of these Fibers, but given I don't know the hexapdf codebase I have little clue. I think it could be interesting as an experiment to try to remove one or both of these 2 main sources of Fibers, or reuse a Fiber across pages and see what effect this has on performance, footprint (each Fiber needs a native stack) and how much the logic would need to change. But this is of course totally up to you, and also feel free to close the issue.

gettalong commented 1 year ago

Ah BTW something else that could be useful is Enumerator.new { |y| ... y << 42 ... }, as long as next/peek is not used that also doesn't use a Fiber, but it has all Enumerable methods available.

I'm not sure if this would help... Right now I can say: Here is the content, wrap it in flate encode and then encode it as ASCII hex characters. With the Enumerator and not using #next, the control flow would be reversed, wouldn't it? Ie. if I can only use Enumerator#each.

gettalong commented 1 year ago

Right. I wonder how much one would need to change the code to avoid some of these Fibers, but given I don't know the hexapdf codebase I have little clue.

It is assumed everywhere that PDF filters (like flate encoding) are implemented with Fibers. So this would need to be changed on a global basis.

However...

I think it could be interesting as an experiment to try to remove one or both of these 2 main sources of Fibers, or reuse a Fiber across pages and see what effect this has on performance, footprint (each Fiber needs a native stack) and how much the logic would need to change.

In the context of the benchmark it is possible to easily remove one of them, the one responsible for providing the data. You can try the branch https://github.com/gettalong/hexapdf/tree/fiber-removal. It should work with the YJIT benchmark code but I didn't run it. Local testing with the original line wrapping benchmark code showed that half of the fibers were gone.

eregon commented 1 year ago

With the Enumerator and not using #next, the control flow would be reversed, wouldn't it?

Yes indeed. That's of course a nice property of Fibers and I don't think it can really be replaced by anything else without reversing control flow. Sometimes that's easy and sometimes that's hard.

https://github.com/gettalong/hexapdf/tree/fiber-removal looks good to me, IIUC that's fully equivalent but avoids half of the Fibers, so then it would make sense to merge it (with a different name), no?

I'll benchmark it, I need a fix in TruffleRuby/GraalVM first and then I should be able to compare before/after that branch.

gettalong commented 1 year ago

No, that branch is just a quick fix to get the benchmark working. It doesn't cover all cases and needs more work. If your tests show better results, I will have a more in-depth look at it.

eregon commented 1 year ago

I ran the benchmark on TruffleRuby JVM CE, unmodified and with that commit locally applied (on top of hexapdf-0.16.0), 2 runs of each. Original: 952ms, 942ms fiber-removal: 851ms, 844ms

So that's a clear 12% speedup from that change.

gettalong commented 1 year ago

Thanks for benchmarking! I will see if I can make the change work generally.

gettalong commented 1 year ago

@eregon Btw I didn't see any difference when running on MRI.

eregon commented 1 year ago

I tried to run it with 3.2 YJIT but it was a bit noisy and I didn't have much time to benchmark. It looked like a small improvement. One thing to consider is TruffleRuby is about 2x faster on this benchmark (once warmed up, on my machine), so the impact of Fibers might be less noticeable (but could become more as JITs get better). And of course Fibers are also cheaper to create on CRuby.

gettalong commented 1 year ago

@eregon I have implemented the change in a more general way and pushed the changes to the https://github.com/gettalong/hexapdf/tree/devel branch. Could you run your benchmark again to see if that still has benefits for TruffleRuby?

I tested locally using the benchmark/raw_text/hexapdf.rb script and Ruby 3.2 (with and without YJIT) but saw no tangible performance benefits but also no performance drops.

eregon commented 1 year ago

I have implemented the change in a more general way

Thank you!

Could you run your benchmark again to see if that still has benefits for TruffleRuby?

I'm confident it would give the same or very similar gains, and the difference with the previous patch is marginal in terms of performance. I don't have too much time this week, I have many urgent things going on. The cost is creating Fibers (which creates an OS Thread on TruffleRuby) and switching from/to Fibers (which is a full thread context switch on TruffleRuby), plus not being able to inline anything between the Fiber and its caller.

gettalong commented 1 year ago

Thanks @eregon - then I will close this issue.

gettalong / hexapdf

1641 Fibers created by hexapdf per iteration for yjit-bench benchmark #240