Shopify / yjit

Optimizing JIT compiler built inside CRuby
657 stars 19 forks source link

More of yjit documentation #312

Closed konalegi closed 12 months ago

konalegi commented 1 year ago

First of all, thanks for the great jobs with YJIT and your work on Ruby's performance!

Recently we have enabled YJIT for our monolith application and got not-so-great results, around 6% of speed-up of rack latency. I'm thinking maybe I could leverage different command line options like --yjit-exec-mem-size and --yjit-call-threshold, do you know which stats RubyVM::YJIT.runtime_stats I should base on to determine best values for it? For instance, how do I determine that the default memory 64MB is enough for my application? Overall if you could document all stats detailed, would be really awesome.

Thanks!

noahgibbs commented 1 year ago

--yjit-call-threshold will only help if there are things you want compiled and they're not getting compiled. The easiest way to test that is to use "--yjit-call-threshold=1" and see if anything improves. If not, the threshold wasn't your problem! Normally it's fine for Rack servers and other long-running processes. It's rare to have a method slow enough to make a major performance difference, but that you don't call at least 30 times.

We don't normally keep a full and detailed stats breakdown because we change them over time. It's a reasonable request, but we're not convinced they're going to stay stable, long-term. It is probably time to talk with the team about whether some of the longest-lived production-mode stats (e.g. inline_code_size, total_exit_count, avg_len_in_yjit) are a good idea to document -- I'll be surprised if we get rid of those or change them significantly.

If you're trying to tell if you don't have enough memory, you can look at inline_code_size plus outlined_code_size -- those are the two things that get put into your executable (generated code) memory. You can also look at code_gc_count. It's normal to have it rise occasionally, and slowly -- like with regular memory we do a GC periodically, and that's fine. But with too little memory, YJIT has to do code GCs more often to free up space.

As with threshold, if you increase the memory a lot (128MB, say, or 256MB) and you don't see much difference in code_gc_count then you had plenty of memory and it was fine. Code GCs aren't all that slow. The main worry with them is that you may be getting a lot of GCs because there's just not enough space -- the GCs aren't the problem, but they're the first obvious symptom of it.

I wouldn't normally change --yjit-max-versions or --yjit-greedy-versioning, and I'm not going to suggest it in yjit.md. You can, but we rarely use them internally. When we do it's mainly for trying to figure out if we should change default settings. There are rare cases where they might make sense to tune in production, but that would be a last resort for a case where the default YJIT performance is bad because of a use case we didn't expect. If you change those, it means you'll want to re-test them with every Ruby upgrade. We don't recommend most people do it.

noahgibbs commented 1 year ago

I've added a PR that helps a bit with this. https://github.com/ruby/ruby/pull/7840

maximecb commented 1 year ago

I understand you would like to see more of a speedup, but 6% is not bad.

How much of a speedup you get is dependent on a variety of factors, some of which are out of YJIT's control, such as how much time your app spends waiting on I/O or database requests.

In terms of helping figure out what YJIT could do better, the most useful things would be to dump the output of running your app with --yjit-stats. If you can dump some sample RubyVM::YJIT.runtime_stats in this thread, we may be able to provide some useful feedback.

konalegi commented 1 year ago

@noahgibbs Thank you for your explanation, it was really, helpful. Sorry for a long reply, but I wanted to play around with new information :) I want to share my little story of YJIT optimization, it might be helpful for your team or for someone else. First of all, it looks like there is no problem with memory (the graph below is (inline_code_size+outlined_code_size)/(1024*1024)

Screenshot 2023-05-26 at 09 04 11

and the graph of code_gc_count

Screenshot 2023-05-26 at 09 06 22

these graphs are from RubyVM::YJIT.runtime_stats of production servers without --yjit-stats.

So, based on the explanation above, I've made an assumption, that a tiny part of my app is YJITted. I've created local synthetic tests to play around with other parameters and got exciting results. The param that caught my attention is avg_len_in_yjit and if I understand it correctly, it's the amount of code that is being kept in YJIT and without side exit to default interpreter.

Screenshot 2023-05-26 at 09 11 20

And looks like for my setup, increasing --yjit-call-threshold had given me an additional ~6% (in production) with lowered memory usage (~37MB vs ~13MB) and fewer CPU spikes during deployment. Now I got around 12% improvement for rack requests, which I think is excellent.

The numbers for the synthetic tests (in production they are worse) for rack request time of pretty big monolithic rails application (mainly graphql API).

default (yjit disabled): ~1100ms
yjit (with default settings): ~975ms
yjit --yjit-call-threshold=2000: ~889ms
yjit --yjit-call-threshold=2000 --greedy-versioning: ~887ms

Modifying the -yjit-max-versions or --yjit-greedy-versioning does not change anything.

Thanks for your attention! :)

noahgibbs commented 1 year ago

Most of that makes a lot of sense. I'm a little confused by how it works out with increased yjit-call-threshold. That just increases the time before it starts JITting. As you see on your graph above, it gets to about the same avg_len_in_yjit over time (which is on average how many JITted instructions it executes per side exit.)

But I'm not sure why a higher threshold would result in a faster sustained speed for Rack requests. Clearly it does -- that's what you measured. I'm just a little bit confused about why :-)

But your earlier analysis makes sense. As you say, those graphs don't show any memory pressure, or any need to increase exec-mem.

maximecb commented 1 year ago

Increasing the call threshold can help the JIT skip over warm up code, which can be beneficial. I'm curious what you increased the call threshold to?

We also have a new option in Ruby 3.3 to start YJIT in a "paused" state and enable it manually after your app is past its boot stage, which could potentially be helpful, if you are running Ruby head.

noahgibbs commented 1 year ago

Looks like he used a call threshold of 2000.

konalegi commented 1 year ago

@maximecb I've added warmup period to my synthetic tests, so the tests was already running on warmed up app. My call threshold was set to 2000.

konalegi commented 1 year ago

But I'm not sure why a higher threshold would result in a faster sustained speed for Rack requests. Clearly it does -- that's what you measured. I'm just a little bit confused about why :-)

It would be nice to understand for me as well, why did that happen :)

maximecb commented 1 year ago

But I'm not sure why a higher threshold would result in a faster sustained speed for Rack requests. Clearly it does -- that's what you measured. I'm just a little bit confused about why :-)

It would be nice to understand for me as well, why did that happen :)

Would need to look at the stats before and after, and see if anything differs significantly, such as avg_len_in_yjit, or side exits on a particular instruction.