Lobsters is experiencing a memory leak with YJIT

tenderlove commented 6 months ago

I met with the person that runs lobsters at RailsConf. He said they were experiencing a memory leak when using YJIT.

Basically the processes will slowly grow in size until about 12 hours in, at which point they get OOM killed. I had him disable YJIT so we could verify it's actually a YJIT problem. When he disables YJIT the process size plateaus and there is no problem. He said he'll do anything we need to try and debug the problem.

I'm not sure what information we should ask him to get.

tenderlove commented 6 months ago

Also he said he was using Ruby version 3.3.1

maximecb commented 6 months ago

If we could somehow repro this offline it would be really helpful. Potentially with the code of the lobsters benchmark for example. Otherwise any kind of memory profiling to give us a hint where the leak comes from would be helpful.

@tenderworks can you tag the person who reported the problem on this issue?

tenderlove commented 6 months ago

One thing we didn't check: it could be that there is no memory leak, just YJIT keeps compiling stuff and never plateaus. Is there a good way to check whether or not code compilation is plateauing? Or maybe we need to set the maximum memory usage for YJIT on the lobsters deployment.

@pushcx What (if any) configuration options have you set for YJIT? Also, how much memory is on the machine you have? How many puma processes are you running? Also how large are the processes when they get OOM killed?

Thanks!

pushcx commented 6 months ago

Howdy! The code and config are available and, yep, I recently updated to 3.3.1.

I enabled YJIT with 20 puma processes, no config options. The box has 7.8GiB of RAM and no swap. I know with YJIT disabled the processes plateau at 160MB but don't know how large the processes were when they got oomkilled. This box only runs nginx, puma, and cron jobs (db is on the other vps) and htop confirms nothing else is using a significant amount of RAM, so I'd guess they were around 1/20th of RAM.

I don't mind re-enabling YJIT and running it until it crashes again, but as a hobby project the hosting setup is pretty immature and has rounds-to-zero metrics/monitoring, so you'll have to tell me what's interesting to you to gather it.

tenderlove commented 5 months ago

@pushcx as a way to determine whether or no this is a memory leak, would you mind enabling YJIT but put a cap on the executable memory size?

Basically change your Ruby opt to something like this RUBYOPT='--yjit --yjit-exec-mem-size=8'

That would limit the executable memory size to 8 megabytes. If memory plateaus, then we know there is no leak and the default executable memory size might be too big for your machine.

Also one thing I just thought of is that if you're booting the process with --yjit it means YJIT is possibly trying to compile boot time code. In future versions of Rails we're going to add an initializer that enables YJIT after boot, that way we don't try to compile any boot time code.

After we've debugged this excess memory usage, we should investigate adding this initializer to Lobsters.

pushcx commented 5 months ago

https://github.com/lobsters/lobsters-ansible/commit/988c3ef68c892d3c23bc62b05a630974e0c0efcb

Added and deployed. I'll let you know if we crash in the next day or two.

(For eagle-eyed diff readers - before this commit --yjit was present but Aaron and I commented it out in prod on 2024-05-09. I didn't get around to a commit to disable yjit because I've been coding elsewhere the last couple weeks.)

pushcx commented 5 months ago

I haven’t been able to get a clean result out of this yet. I’ve seen unexpected restarts but an attacker has been running a vulnerability scanner against the site so the restarts have rolled out of the log, which is overwhelmed with Rack backtraces about malformed requests. I’ve banned some IPs and we’ll see how it goes.

tenderlove commented 5 months ago

@pushcx thanks for the feedback, and good luck. I look forward to results! (when you have them of course 😄 )

pushcx commented 5 months ago

We're still getting oomkilled, caught one about 15 minutes ago.

maximecb commented 4 months ago

@pushcx can you tell us how long it took approximately to get oomkilled, and how high the limit is set?

Ideally we'd need some kind of offline repro to troubleshoot this issue. We do have an offline benchmark of lobsters. I'll try running it in a loop to see what happens, but it may not be up to date with the latest lobsters code.

maximecb commented 4 months ago

I tried running our lobsters benchmark in a loop for 24 hours with Ruby 3.3.3 while monitoring memory usage, and could not see any clear memory usage growth trend. That being said, this benchmark is not going to be representative of real web traffic.

@pushcx we could really use your help in terms of diagnosing some kind of local reproduction. Also, are you able to track YJIT runtime stats counters over time? There is a dictionary of counters available through RubyVM::YJIT.runtime_stats.

pushcx commented 4 months ago

Looks like they're happening about 16 hours apart, same as before adding --yjit-exec-mem-size=8. The ansible repo is current with prod config.

Does it matter which puma worker I access it through? I could add a /yjit-stats endpoint and poll it every 5 minutes or something, but that'll hit a random worker.

maximecb commented 4 months ago

If would be best if we could watch the same worker evolve over time, but if not, any data is better than no data.

maximecb commented 4 months ago

Also, thanks for your patience and help in diagnosing this issue 🙏

pushcx commented 4 months ago

Can you think of a good way to pin it to a worker? Or get a worker to log it to a standalone file? I can imagine something heinous like including worker number in the endpoint and polling it until I get the right one, but it seems real likely there's a reasonable way to do this I'm ignorant of.

I'm really happy to help debug and give back to the projects Lobsters is built on. I just hope this isn't some weird "clever" code I wrote prompting a snipe hunt.

maximecb commented 4 months ago

Can you think of a good way to pin it to a worker?

@casperisfine may have better advice here than I would. AFAIK the strategy we've used before is to enable periodic logging for only some fraction of the workers. For logging, I would periodically append the serialized dict/string to a file.

I'm really happy to help debug and give back to the projects Lobsters is built on. I just hope this isn't some weird "clever" code I wrote prompting a snipe hunt.

If you are getting OOM with exec mem size 8 then it does look like there's some problem, and if YJIT has some strange memory leak issue, we'd really like to fix it. Just sorry it's taking so much back and forth to track down the problem.

Can you provide some details as to how strict your memory limits are for OOM kill? Like, do you know how much memory is used without YJIT vs how much memory usage triggers the OOM kill?

casperisfine commented 4 months ago

Can you think of a good way to pin it to a worker?

I don't think there's a way to pin a worker with Puma. What you could do however is just have a single worker dump its stats on a regular basis each in a file, e.g.:

# puma.rb
on_worker_boot do |worker_index|
  if worker_index == 0
    Thread.new do
      loop do
        sleep 30
        File.atomic_write("/tmp/yjit-stats.txt", RubyVM::YJIT.runtime_stats)
      end
    end
  end
end

Alternatively, you can dump it every X requests (e.g. in a middleware, that you enable from that hook), and instead of always overwriting the stats file, you can just send it to the general logs, etc.

Caution: My production experience with Puma is limited, so don't blindly copy this snippet and make sure it works :)

pushcx commented 4 months ago

Thanks, @casperisfine. I had a little hassle because ActiveSupport was available in dev but not in prod (!?) but I got something working. Here's the first couple lines, I'll post a longer log to gist tomorrow.

2024-06-26T20:11:57.941+0000 {:inline_code_size=>35003, :outlined_code_size=>28708, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>5, :code_region_size=>81920, :yjit_alloc_size=>198696, :vm_insns_count=>280279, :live_iseq_count=>2869, :code_gc_count=>0, :compiled_iseq_entry=>33, :cold_iseq_entry=>0, :compiled_iseq_count=>44, :compiled_blockid_count=>452, :compiled_block_count=>530, :compiled_branch_count=>902, :compile_time_ns=>21425766, :object_shape_count=>359}
2024-06-26T20:16:57.944+0000 {:inline_code_size=>2954384, :outlined_code_size=>2508633, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>400, :code_region_size=>6553600, :yjit_alloc_size=>16425604, :vm_insns_count=>5343164, :live_iseq_count=>52774, :code_gc_count=>0, :compiled_iseq_entry=>1870, :cold_iseq_entry=>0, :compiled_iseq_count=>3213, :compiled_blockid_count=>29545, :compiled_block_count=>42595, :compiled_branch_count=>81903, :compile_time_ns=>17290725926, :object_shape_count=>3831}
2024-06-26T20:21:57.946+0000 {:inline_code_size=>3670715, :outlined_code_size=>3077024, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>498, :code_region_size=>8155136, :yjit_alloc_size=>20826096, :vm_insns_count=>6393527, :live_iseq_count=>53445, :code_gc_count=>0, :compiled_iseq_entry=>2283, :cold_iseq_entry=>2, :compiled_iseq_count=>3983, :compiled_blockid_count=>37173, :compiled_block_count=>53106, :compiled_branch_count=>102386, :compile_time_ns=>18036527965, :object_shape_count=>3852}

pushcx commented 4 months ago

OK, ran to an oomkill. Note that the last two lines of this log are the replacement worker 0 after the kill. It seemed unlikely there'd be anything too interesting, but why not.

yjit-stats.txt

casperisfine commented 4 months ago

Thanks for the stats. What I see

Good things:

inline_code_size and outlined_code_size is essentially stable after ~10 minutes. So YJIT no longer allocate executable memory very soon.
yjit_alloc_size is also stable at the same time. So YJIT no longer allocate memory for keeping metadata very soon too.
All the compiled_* metrics are also stable at the same time, so YJIT is essentially off.
Shapes are growing but extremely slowly (3831 -> 4808 ), so unlikely to be the issue.

Weird / potentially bad things:

The very first line has very low number. Is eager loading enabled?
vm_insns_count grows linerarly, and quite fast. Starts at 5M, after 1 hour it's at 17M, then 22M, and it reaches 289M right before OOM.
But live_iseq_count is stable, so I suspect something eval code and then it's garbage collected.
With that alone I can't explain memory growth, because as outlined in the "good" secion, neither the executable memory nor the yjit_alloc_size (metadata kept by YJIT) are growing, so according to YJIT it stopped allocating anything after 10 minutes.

casperisfine commented 4 months ago

The very first line has very low number. Is eager loading enabled?

It is https://github.com/lobsters/lobsters/blob/32930ab2e7180b00b5baa774057e7bfb3f222fab/config/environments/production.rb#L15

But I don't see preload_app! in the puma config: https://github.com/lobsters/lobsters/blob/32930ab2e7180b00b5baa774057e7bfb3f222fab/config/puma.rb

casperisfine commented 4 months ago

Ah it was disabled to support puma phased restarts: https://github.com/lobsters/lobsters/commit/9c80ba726116f429ff9c09db3dd2e8bd2bba0ad6

I'd suggest to try with preload_app! see if the problem persist (if it's not too much trouble of course).

pushcx commented 4 months ago

Do you think that enabling it will improve diagnosis of this issue? I'm asking because you clearly know more than I do on this, but I know I'm easily misled by "hey look low hanging fruit" tweaks that distract from the problem at hand. I've been reluctant to deploy code that might create red herrings, including adding --enable-frozen-string-literals (from your PR) to RUBYOPTS in the service.

And I guess I'm a little reluctant to enable preload_app! because of how long it takes workers to come up. They dogpile each other, so deploys take the down for 3-4 minutes. I really like that we only incur that cost on a small minority of deploys. I recently removed my kludge to ~halve startup times by staggering worker starts because it wasn't working correctly and I figured one fewer distraction in puma would be an improvement right now.

(It's sort of ironic how Lobsters hits these inconveniences because it is small enough to run on a single VPS but efficiently serves enough traffic that a number of visitors notice the downtime.)

byroot commented 4 months ago

I'll wait for the others to weight in. What's weird is that isns are growing but not live iseq, so my suspicion is that something is compiling an unbounded amount of code, but not holding onto it. So I suspect there is a leak in the VM but not YJIT specifically.

So I'm curious whether YJIT is actually the cause of the leak, or if it just makes the OOM happens much sooner because of the higher memory usage baseline.

As for preload app, on paper it should impact what we are tracking, but it can save quite a bit of memory, so help, but not fix. As for your issues with preload app, I'd be very interested to dive more into it in the future because I don't quite understand. But I don't want to distract this thread and I'm on my way out for the week end.

maximecb commented 4 months ago

vm_insns_count grows linerarly, and quite fast. Starts at 5M, after 1 hour it's at 17M, then 22M, and it reaches 289M right before OOM.

@casperisfine the vm_insns_count is just the count of instructions executed by the interpreter.

It does look like YJIT execution stops pretty early on and YJIT stops compiling new code. Ideally we would need some memory profiling of some sort, or maybe we could look at GC stats?

One of the things that's different with YJIT enabled is that we do pin some GC'd object references in the machine code. We could be keeping more GC roots alive. Though it's not fully clear how that could result in unbounded memory growth over time after compilation is done? @XrXr

tenderlove commented 4 months ago

maybe we could look at GC stats

IMO we should do this. If we could dump GC.stat on an interval that would tell us if it's an object leak.

pushcx commented 4 months ago

OK, I just deployed the above commit to add GC.stat to the output. I'll run it to an oomkill and post a new log tomorrow.

maximecb commented 4 months ago

Thanks again Peter 🙏

pushcx commented 4 months ago

Here's the latest, it bounced about 50 minutes ago.

yjit-stats.txt

casperisfine commented 4 months ago

Thanks for the stats. I graphed heap_live_slots (number of live objects):

There's definitely some slow growth over time. Not a huge amount thought, I'm not sure if I'd really qualify it as a leak to be honest. Looks more like things that are lazily initialized.

In term of allocated pages, it looks like this:

To really know if YJIT is to blame, we'd need to compare with a worker that doesn't have YJIT on, I very much suspect we'll see the same growth.

maximecb commented 4 months ago

It does seem like heap live slots matches allocated pages, so presumably the growth is coming from Ruby objects 🤔

Also curious to know if we see the same thing without YJIT. Is it possible that YJIT processes just hit OOM sooner than non-YJIT processes would?

casperisfine commented 4 months ago

Is it possible that YJIT processes just hit OOM sooner than non-YJIT processes would?

That's my current theory yes. I'm not fully certain this is a leak per say, leaks generally are much faster than that. Here the growth is really slow, so it looks more like some infrequently executed code paths that lazy initialize global state. So over time the memory grows a bit, and probably stabilize after a long time, or grow so slow that without YJIT there's enough memory available that the OOM is never reached.

So yeah, I suspect that since YJIT adds a noticeable extra memory baseline on each worker, there's less leeway and OOM is hit after ~18 hours.

Now the next step to debug this I'm afraid is ObjectSpace.dump_all. The usual way to go about it is to dump the heap once after boot, and then once again after X hours or X requests, and diff the two heap dumps to see what was retained after boot.

Unfortunately it's a bit tricky to do with Puma, because you want to do that outside of the request cycle, and Puma being threaded, there's not an ideal way. Another option is to just use rbtrace and to manually get two dumps of the same process.

I'll leave https://stevenharman.net/so-we-have-a-memory-leak here which is a pretty good post describing how to dig into memory leaks.

If you need help to dig into these heap dumps, we can help, but keep in mind all strings in the heap are dumped, so if you think about sharing a memory dump, make sure to strip all string values (in case there might be keys or some personal data in the heap).

pushcx commented 4 months ago

At the beginning of this, @tenderlove pointed out an obvious test, which was: does it go away if we disable yjit? (Yes.) In that spirit, another test occurs to me: what if I just turn down the number of workers? If it's a leak, we'll just take a little longer to crash. But maybe there's too many workers, yjit asks for a bit more RAM per worker, they slowly initialize/grow, and then after long enough they exhaust the box. I'm going to try turning down the number of workers first.

As for dump_all, I think I'd replace the contents of the on_worker_boot that you wrote, @casperisfine. Do it once at startup and then in the thread after, say, 6 hours. Along the lines of your note on doing it outside of the request cycle, Aaron and I struggled a bit trying to attach to a running worker and dump it; it didn't finish before the supervisor recycled the worker for being unresponsive.

byroot commented 4 months ago

As for dump_all, I think I'd replace the contents of the on_worker_boot that you wrote

The issue with that is that the dump will happen in a middle of a/mutliple requests, so the heap will be full of objects that aren't retained, so it will be hard to exploit the dump.

pushcx commented 4 months ago

Ah, shoot. Thanks for explaining why that wouldn't work.

I guess after a couple days of this experiment with fewer puma workers I'll try @maximecb's suggestion of disabling yjit and gathering the same stats to compare.

(Experiments this week will take longer. Lobsters traffic is highly correlated to the US work day. There's a national holiday tomorrow and many people will be out on Friday as well.)

pushcx commented 4 months ago

yjit-stats.txt

I think there may even have been two oomkills in the last 20h, which is surprising. Unfortunately another vulnerability scanner has scrolled off the journalctl messages.

byroot commented 4 months ago

What's the difference with these last stats? less workers?

byroot commented 4 months ago

Nevermind, looking at the stats YJIT seems disabled.

byroot commented 4 months ago

Here's the live slot graphs from these new stats:

With just skimming it, the number of objects, and growth pattern look very similar to me.

casperisfine commented 4 months ago

It's also interesting to note how in both cases there is a relatively big jump after some indefinite amount of time. I wonder if it's a particular endpoint that is rarely accessed that causes this.

pushcx commented 4 months ago

Uh, that’s surprising, it shouldn’t be disabled. It hasn’t been disabled since before we opened this issue, and the most recent tweak has as well.

The jumps… we have a couple endpoints with fairly long cache timeouts. StatsController for /stats is the first thing that comes to mind for long cache and querying a fair amount of data. Maybe /users, and it also has a one day cache (though I thought it was shorter, and it probably should be). Otherwise I’d have to grep for use of Rails.cache to see if there’s anything like that.

The other possibility for infrequent oddities is bad bots. Aside from the vulnerability scanners I have griped about a couple times here, the site is a popular choice for people writing their very first bots. (And we still have naive limit/offset pagination, so there are temptingly incrementable numbers in many URLs.)

byroot commented 4 months ago

Uh, that’s surprising, it shouldn’t be disabled

Nah, you are right, I read the logs too fast. It's indeed enabled.

we have a couple endpoints with fairly long cache timeouts

Yeah, but that's something that's tricky to identify like this. The only reliable way is a heap dump, we gotta figure out a proper way to get a clean one. I'll try to ask people who run puma how they do this sort of stuff.

byroot commented 4 months ago

Alright, I think we could try to use Puma's out_of_band hook. It's semantic is a bit weird, basically it is called when a request finishes and no other thread is busy. So assuming constant load it might never be called, but it's worth a try.

byroot commented 4 months ago

Hoping this will do: https://github.com/lobsters/lobsters/pull/1292

pushcx commented 3 months ago

I'm giving this a try now. Sorry for the delay, things have been a little hectic for me.

maximecb commented 3 months ago

No problem. Thanks for not giving up on it :)

pushcx commented 3 months ago

I haven't rigorously tracked, but I'm confident in the observation given that it's been a couple weeks. Turning down the number of workers seems to have significantly shortened the time between oomkills. We've gotten them more like every 6-8 hours instead of 11-13. I'm really surprised by this, it's the opposite of what i expected.

I mention it because puma just rebooted after 6 hours without triggering the second dump (heap-bloated.json) from lobsters/lobsters#1292.

byroot commented 3 months ago

That's very interesting indeed. Not unseen territory but rare. Could be explained by the smaller pool making individual workers process more requests, hence be more busy, hence leak / bloat faster.

If the second dump never triggered, you can try lowering the threshold a bit.

pushcx commented 3 months ago

@byroot I had to turn it down a couple times but I got the two heap dumps. Next steps?

byroot commented 3 months ago

Next steps?

Now you can use https://github.com/oxidize-rb/reap to generate a flame graph of both each snapshots, which should help visualize what take memory in you heap and by comparing both, what is not present on a fresh worker.

Other tools include https://github.com/Shopify/heap-profiler, that give you a summary of the object types in the heap.

And then if you find something weird, you can dig into it with https://github.com/jhawthorn/sheap.

You should be able to share the generated flamegraphs, as they don't contain any secrets, I'll be happy to have a look.

You can also share the actual heaps, but you'd need to first use jq or something like that to strip all the "value" fields.

Shopify / ruby

Lobsters is experiencing a memory leak with YJIT #556