Closed tenderlove closed 2 months ago
Also he said he was using Ruby version 3.3.1
If we could somehow repro this offline it would be really helpful. Potentially with the code of the lobsters benchmark for example. Otherwise any kind of memory profiling to give us a hint where the leak comes from would be helpful.
@tenderworks can you tag the person who reported the problem on this issue?
One thing we didn't check: it could be that there is no memory leak, just YJIT keeps compiling stuff and never plateaus. Is there a good way to check whether or not code compilation is plateauing? Or maybe we need to set the maximum memory usage for YJIT on the lobsters deployment.
@pushcx What (if any) configuration options have you set for YJIT? Also, how much memory is on the machine you have? How many puma processes are you running? Also how large are the processes when they get OOM killed?
Thanks!
Howdy! The code and config are available and, yep, I recently updated to 3.3.1.
I enabled YJIT with 20 puma processes, no config options. The box has 7.8GiB of RAM and no swap. I know with YJIT disabled the processes plateau at 160MB but don't know how large the processes were when they got oomkilled. This box only runs nginx, puma, and cron jobs (db is on the other vps) and htop confirms nothing else is using a significant amount of RAM, so I'd guess they were around 1/20th of RAM.
I don't mind re-enabling YJIT and running it until it crashes again, but as a hobby project the hosting setup is pretty immature and has rounds-to-zero metrics/monitoring, so you'll have to tell me what's interesting to you to gather it.
@pushcx as a way to determine whether or no this is a memory leak, would you mind enabling YJIT but put a cap on the executable memory size?
Basically change your Ruby opt to something like this RUBYOPT='--yjit --yjit-exec-mem-size=8'
That would limit the executable memory size to 8 megabytes. If memory plateaus, then we know there is no leak and the default executable memory size might be too big for your machine.
Also one thing I just thought of is that if you're booting the process with --yjit
it means YJIT is possibly trying to compile boot time code. In future versions of Rails we're going to add an initializer that enables YJIT after boot, that way we don't try to compile any boot time code.
After we've debugged this excess memory usage, we should investigate adding this initializer to Lobsters.
https://github.com/lobsters/lobsters-ansible/commit/988c3ef68c892d3c23bc62b05a630974e0c0efcb
Added and deployed. I'll let you know if we crash in the next day or two.
(For eagle-eyed diff readers - before this commit --yjit
was present but Aaron and I commented it out in prod on 2024-05-09. I didn't get around to a commit to disable yjit because I've been coding elsewhere the last couple weeks.)
I haven’t been able to get a clean result out of this yet. I’ve seen unexpected restarts but an attacker has been running a vulnerability scanner against the site so the restarts have rolled out of the log, which is overwhelmed with Rack backtraces about malformed requests. I’ve banned some IPs and we’ll see how it goes.
@pushcx thanks for the feedback, and good luck. I look forward to results! (when you have them of course 😄 )
We're still getting oomkilled, caught one about 15 minutes ago.
@pushcx can you tell us how long it took approximately to get oomkilled, and how high the limit is set?
Ideally we'd need some kind of offline repro to troubleshoot this issue. We do have an offline benchmark of lobsters. I'll try running it in a loop to see what happens, but it may not be up to date with the latest lobsters code.
I tried running our lobsters benchmark in a loop for 24 hours with Ruby 3.3.3 while monitoring memory usage, and could not see any clear memory usage growth trend. That being said, this benchmark is not going to be representative of real web traffic.
@pushcx we could really use your help in terms of diagnosing some kind of local reproduction. Also, are you able to track YJIT runtime stats counters over time? There is a dictionary of counters available through RubyVM::YJIT.runtime_stats
.
Looks like they're happening about 16 hours apart, same as before adding --yjit-exec-mem-size=8
. The ansible repo is current with prod config.
Does it matter which puma worker I access it through? I could add a /yjit-stats
endpoint and poll it every 5 minutes or something, but that'll hit a random worker.
If would be best if we could watch the same worker evolve over time, but if not, any data is better than no data.
Also, thanks for your patience and help in diagnosing this issue 🙏
Can you think of a good way to pin it to a worker? Or get a worker to log it to a standalone file? I can imagine something heinous like including worker number in the endpoint and polling it until I get the right one, but it seems real likely there's a reasonable way to do this I'm ignorant of.
I'm really happy to help debug and give back to the projects Lobsters is built on. I just hope this isn't some weird "clever" code I wrote prompting a snipe hunt.
Can you think of a good way to pin it to a worker?
@casperisfine may have better advice here than I would. AFAIK the strategy we've used before is to enable periodic logging for only some fraction of the workers. For logging, I would periodically append the serialized dict/string to a file.
I'm really happy to help debug and give back to the projects Lobsters is built on. I just hope this isn't some weird "clever" code I wrote prompting a snipe hunt.
If you are getting OOM with exec mem size 8 then it does look like there's some problem, and if YJIT has some strange memory leak issue, we'd really like to fix it. Just sorry it's taking so much back and forth to track down the problem.
Can you provide some details as to how strict your memory limits are for OOM kill? Like, do you know how much memory is used without YJIT vs how much memory usage triggers the OOM kill?
Can you think of a good way to pin it to a worker?
I don't think there's a way to pin a worker with Puma. What you could do however is just have a single worker dump its stats on a regular basis each in a file, e.g.:
# puma.rb
on_worker_boot do |worker_index|
if worker_index == 0
Thread.new do
loop do
sleep 30
File.atomic_write("/tmp/yjit-stats.txt", RubyVM::YJIT.runtime_stats)
end
end
end
end
Alternatively, you can dump it every X requests (e.g. in a middleware, that you enable from that hook), and instead of always overwriting the stats file, you can just send it to the general logs, etc.
Caution: My production experience with Puma is limited, so don't blindly copy this snippet and make sure it works :)
Thanks, @casperisfine. I had a little hassle because ActiveSupport was available in dev but not in prod (!?) but I got something working. Here's the first couple lines, I'll post a longer log to gist tomorrow.
2024-06-26T20:11:57.941+0000 {:inline_code_size=>35003, :outlined_code_size=>28708, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>5, :code_region_size=>81920, :yjit_alloc_size=>198696, :vm_insns_count=>280279, :live_iseq_count=>2869, :code_gc_count=>0, :compiled_iseq_entry=>33, :cold_iseq_entry=>0, :compiled_iseq_count=>44, :compiled_blockid_count=>452, :compiled_block_count=>530, :compiled_branch_count=>902, :compile_time_ns=>21425766, :object_shape_count=>359}
2024-06-26T20:16:57.944+0000 {:inline_code_size=>2954384, :outlined_code_size=>2508633, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>400, :code_region_size=>6553600, :yjit_alloc_size=>16425604, :vm_insns_count=>5343164, :live_iseq_count=>52774, :code_gc_count=>0, :compiled_iseq_entry=>1870, :cold_iseq_entry=>0, :compiled_iseq_count=>3213, :compiled_blockid_count=>29545, :compiled_block_count=>42595, :compiled_branch_count=>81903, :compile_time_ns=>17290725926, :object_shape_count=>3831}
2024-06-26T20:21:57.946+0000 {:inline_code_size=>3670715, :outlined_code_size=>3077024, :freed_page_count=>0, :freed_code_size=>0, :live_page_count=>498, :code_region_size=>8155136, :yjit_alloc_size=>20826096, :vm_insns_count=>6393527, :live_iseq_count=>53445, :code_gc_count=>0, :compiled_iseq_entry=>2283, :cold_iseq_entry=>2, :compiled_iseq_count=>3983, :compiled_blockid_count=>37173, :compiled_block_count=>53106, :compiled_branch_count=>102386, :compile_time_ns=>18036527965, :object_shape_count=>3852}
OK, ran to an oomkill. Note that the last two lines of this log are the replacement worker 0 after the kill. It seemed unlikely there'd be anything too interesting, but why not.
Thanks for the stats. What I see
Good things:
inline_code_size
and outlined_code_size
is essentially stable after ~10 minutes. So YJIT no longer allocate executable memory very soon. yjit_alloc_size
is also stable at the same time. So YJIT no longer allocate memory for keeping metadata very soon too.compiled_*
metrics are also stable at the same time, so YJIT is essentially off.3831
-> 4808
), so unlikely to be the issue.Weird / potentially bad things:
vm_insns_count
grows linerarly, and quite fast. Starts at 5M, after 1 hour it's at 17M, then 22M, and it reaches 289M right before OOM.live_iseq_count
is stable, so I suspect something eval code and then it's garbage collected.yjit_alloc_size
(metadata kept by YJIT) are growing, so according to YJIT it stopped allocating anything after 10 minutes.
- The very first line has very low number. Is eager loading enabled?
But I don't see preload_app!
in the puma config: https://github.com/lobsters/lobsters/blob/32930ab2e7180b00b5baa774057e7bfb3f222fab/config/puma.rb
Ah it was disabled to support puma phased restarts: https://github.com/lobsters/lobsters/commit/9c80ba726116f429ff9c09db3dd2e8bd2bba0ad6
I'd suggest to try with preload_app!
see if the problem persist (if it's not too much trouble of course).
Do you think that enabling it will improve diagnosis of this issue? I'm asking because you clearly know more than I do on this, but I know I'm easily misled by "hey look low hanging fruit" tweaks that distract from the problem at hand. I've been reluctant to deploy code that might create red herrings, including adding --enable-frozen-string-literals
(from your PR) to RUBYOPTS
in the service.
And I guess I'm a little reluctant to enable preload_app!
because of how long it takes workers to come up. They dogpile each other, so deploys take the down for 3-4 minutes. I really like that we only incur that cost on a small minority of deploys. I recently removed my kludge to ~halve startup times by staggering worker starts because it wasn't working correctly and I figured one fewer distraction in puma would be an improvement right now.
(It's sort of ironic how Lobsters hits these inconveniences because it is small enough to run on a single VPS but efficiently serves enough traffic that a number of visitors notice the downtime.)
I'll wait for the others to weight in. What's weird is that isns are growing but not live iseq, so my suspicion is that something is compiling an unbounded amount of code, but not holding onto it. So I suspect there is a leak in the VM but not YJIT specifically.
So I'm curious whether YJIT is actually the cause of the leak, or if it just makes the OOM happens much sooner because of the higher memory usage baseline.
As for preload app, on paper it should impact what we are tracking, but it can save quite a bit of memory, so help, but not fix. As for your issues with preload app, I'd be very interested to dive more into it in the future because I don't quite understand. But I don't want to distract this thread and I'm on my way out for the week end.
vm_insns_count grows linerarly, and quite fast. Starts at 5M, after 1 hour it's at 17M, then 22M, and it reaches 289M right before OOM.
@casperisfine the vm_insns_count
is just the count of instructions executed by the interpreter.
It does look like YJIT execution stops pretty early on and YJIT stops compiling new code. Ideally we would need some memory profiling of some sort, or maybe we could look at GC stats?
One of the things that's different with YJIT enabled is that we do pin some GC'd object references in the machine code. We could be keeping more GC roots alive. Though it's not fully clear how that could result in unbounded memory growth over time after compilation is done? @XrXr
maybe we could look at GC stats
IMO we should do this. If we could dump GC.stat
on an interval that would tell us if it's an object leak.
OK, I just deployed the above commit to add GC.stat to the output. I'll run it to an oomkill and post a new log tomorrow.
Thanks again Peter 🙏
Here's the latest, it bounced about 50 minutes ago.
Thanks for the stats. I graphed heap_live_slots
(number of live objects):
There's definitely some slow growth over time. Not a huge amount thought, I'm not sure if I'd really qualify it as a leak to be honest. Looks more like things that are lazily initialized.
In term of allocated pages, it looks like this:
To really know if YJIT is to blame, we'd need to compare with a worker that doesn't have YJIT on, I very much suspect we'll see the same growth.
It does seem like heap live slots matches allocated pages, so presumably the growth is coming from Ruby objects 🤔
Also curious to know if we see the same thing without YJIT. Is it possible that YJIT processes just hit OOM sooner than non-YJIT processes would?
Is it possible that YJIT processes just hit OOM sooner than non-YJIT processes would?
That's my current theory yes. I'm not fully certain this is a leak per say, leaks generally are much faster than that. Here the growth is really slow, so it looks more like some infrequently executed code paths that lazy initialize global state. So over time the memory grows a bit, and probably stabilize after a long time, or grow so slow that without YJIT there's enough memory available that the OOM is never reached.
So yeah, I suspect that since YJIT adds a noticeable extra memory baseline on each worker, there's less leeway and OOM is hit after ~18 hours.
Now the next step to debug this I'm afraid is ObjectSpace.dump_all
. The usual way to go about it is to dump the heap once after boot, and then once again after X hours or X requests, and diff the two heap dumps to see what was retained after boot.
Unfortunately it's a bit tricky to do with Puma, because you want to do that outside of the request cycle, and Puma being threaded, there's not an ideal way. Another option is to just use rbtrace
and to manually get two dumps of the same process.
I'll leave https://stevenharman.net/so-we-have-a-memory-leak here which is a pretty good post describing how to dig into memory leaks.
If you need help to dig into these heap dumps, we can help, but keep in mind all strings in the heap are dumped, so if you think about sharing a memory dump, make sure to strip all string values (in case there might be keys or some personal data in the heap).
At the beginning of this, @tenderlove pointed out an obvious test, which was: does it go away if we disable yjit? (Yes.) In that spirit, another test occurs to me: what if I just turn down the number of workers? If it's a leak, we'll just take a little longer to crash. But maybe there's too many workers, yjit asks for a bit more RAM per worker, they slowly initialize/grow, and then after long enough they exhaust the box. I'm going to try turning down the number of workers first.
As for dump_all
, I think I'd replace the contents of the on_worker_boot
that you wrote, @casperisfine. Do it once at startup and then in the thread after, say, 6 hours. Along the lines of your note on doing it outside of the request cycle, Aaron and I struggled a bit trying to attach to a running worker and dump it; it didn't finish before the supervisor recycled the worker for being unresponsive.
As for
dump_all
, I think I'd replace the contents of theon_worker_boot
that you wrote
The issue with that is that the dump will happen in a middle of a/mutliple requests, so the heap will be full of objects that aren't retained, so it will be hard to exploit the dump.
Ah, shoot. Thanks for explaining why that wouldn't work.
I guess after a couple days of this experiment with fewer puma workers I'll try @maximecb's suggestion of disabling yjit and gathering the same stats to compare.
(Experiments this week will take longer. Lobsters traffic is highly correlated to the US work day. There's a national holiday tomorrow and many people will be out on Friday as well.)
I think there may even have been two oomkills in the last 20h, which is surprising. Unfortunately another vulnerability scanner has scrolled off the journalctl messages.
What's the difference with these last stats? less workers?
Nevermind, looking at the stats YJIT seems disabled.
Here's the live slot graphs from these new stats:
With just skimming it, the number of objects, and growth pattern look very similar to me.
It's also interesting to note how in both cases there is a relatively big jump after some indefinite amount of time. I wonder if it's a particular endpoint that is rarely accessed that causes this.
Uh, that’s surprising, it shouldn’t be disabled. It hasn’t been disabled since before we opened this issue, and the most recent tweak has as well.
The jumps… we have a couple endpoints with fairly long cache timeouts. StatsController for /stats is the first thing that comes to mind for long cache and querying a fair amount of data. Maybe /users, and it also has a one day cache (though I thought it was shorter, and it probably should be). Otherwise I’d have to grep for use of Rails.cache
to see if there’s anything like that.
The other possibility for infrequent oddities is bad bots. Aside from the vulnerability scanners I have griped about a couple times here, the site is a popular choice for people writing their very first bots. (And we still have naive limit/offset pagination, so there are temptingly incrementable numbers in many URLs.)
Uh, that’s surprising, it shouldn’t be disabled
Nah, you are right, I read the logs too fast. It's indeed enabled.
we have a couple endpoints with fairly long cache timeouts
Yeah, but that's something that's tricky to identify like this. The only reliable way is a heap dump, we gotta figure out a proper way to get a clean one. I'll try to ask people who run puma how they do this sort of stuff.
Alright, I think we could try to use Puma's out_of_band
hook. It's semantic is a bit weird, basically it is called when a request finishes and no other thread is busy. So assuming constant load it might never be called, but it's worth a try.
Hoping this will do: https://github.com/lobsters/lobsters/pull/1292
I'm giving this a try now. Sorry for the delay, things have been a little hectic for me.
No problem. Thanks for not giving up on it :)
I haven't rigorously tracked, but I'm confident in the observation given that it's been a couple weeks. Turning down the number of workers seems to have significantly shortened the time between oomkills. We've gotten them more like every 6-8 hours instead of 11-13. I'm really surprised by this, it's the opposite of what i expected.
I mention it because puma just rebooted after 6 hours without triggering the second dump (heap-bloated.json
) from lobsters/lobsters#1292.
That's very interesting indeed. Not unseen territory but rare. Could be explained by the smaller pool making individual workers process more requests, hence be more busy, hence leak / bloat faster.
If the second dump never triggered, you can try lowering the threshold a bit.
@byroot I had to turn it down a couple times but I got the two heap dumps. Next steps?
Next steps?
Now you can use https://github.com/oxidize-rb/reap to generate a flame graph of both each snapshots, which should help visualize what take memory in you heap and by comparing both, what is not present on a fresh worker.
Other tools include https://github.com/Shopify/heap-profiler, that give you a summary of the object types in the heap.
And then if you find something weird, you can dig into it with https://github.com/jhawthorn/sheap.
You should be able to share the generated flamegraphs, as they don't contain any secrets, I'll be happy to have a look.
You can also share the actual heaps, but you'd need to first use jq
or something like that to strip all the "value"
fields.
I met with the person that runs lobsters at RailsConf. He said they were experiencing a memory leak when using YJIT.
Basically the processes will slowly grow in size until about 12 hours in, at which point they get OOM killed. I had him disable YJIT so we could verify it's actually a YJIT problem. When he disables YJIT the process size plateaus and there is no problem. He said he'll do anything we need to try and debug the problem.
I'm not sure what information we should ask him to get.