Closed tenderlove closed 2 months ago
@byroot generously spent some time pairing with me on this today.
He spotted that Rubocop was getting loaded in production and PRd a fix. And that the flame graph conspicuously had a big ~20M chunk for I18n. Lobsters doesn't ship translations but it uses helpers like number_with_delimiter
that use I18n. We shifted its load to boot time so it's not a distraction in future heap dumps.
Another thing that @byroot brought up is that Lobsters doesn't use jemalloc. I only learned of it recently and didn't want to change something that may be related while this bug was open. There wasn't really anything conspicuous in the flamegraph or output from heap-profiler, so maybe what Lobsters has been seeing is just memory fragmentation rather than a leak. That might also explain the oddity that oomkills got more frequent when I halved the number of workers. If this is it, I'm sorry to distract/cause concern that there might have been a leak in yjit. Maybe something in this issue points towards a way to detect and alert about this scenario? I dunno, I'm looking for a silver lining.
I'm going to dump heaps again in the next few days with these changes to see if the picture's clearer.
I deployed jemalloc 48 hours ago and, for the first time since I enabled yjit, we've gone more than 14 hours without an oomkill of puma. I still plan to dump heaps and look at them, but I think this is conclusive evidence that Lobsters had been seeing a memory fragmentation issue rather than a memory leak.
I don't know enough about the allocator to know if my hope from my previous comment that this situation could be detected and alerted is possible, and I don't know this project's workflow well enough to know if you'd want to keep this issue open or close it in favor of another. Either way, I think it can probably close with @byroot having diagnosed and fixed the core issue here.
Maybe something in this issue points towards a way to detect and alert about this scenario?
Unfortunately I don't really see a way to do that. The allocator is a bit of black box for Ruby.
THere was some suggestion to make jemalloc
the default like for redis
but it didn't go anywhere.
Anyway, happy your issues are solved and that it wasn't a YJIT issue.
I met with the person that runs lobsters at RailsConf. He said they were experiencing a memory leak when using YJIT.
Basically the processes will slowly grow in size until about 12 hours in, at which point they get OOM killed. I had him disable YJIT so we could verify it's actually a YJIT problem. When he disables YJIT the process size plateaus and there is no problem. He said he'll do anything we need to try and debug the problem.
I'm not sure what information we should ask him to get.