Shopify / ruby

The Ruby Programming Language [mirror]
https://www.ruby-lang.org/
Other
41 stars 14 forks source link

YJIT is indirectly retaining ActiveRecord Object instances #552

Closed intrip closed 9 months ago

intrip commented 9 months ago

We noticed a slow and steady memory leak in our web app, to investigate it I enabled ObjectSpace.trace_object_allocations_start and extracted a few heap dumps, then started analysing them with sheap.

I found a common pattern: A yjit_root node holding references to ActiveRecord methods holding references to instances of ActiveRecord models then holding references to their instance attributes. These objects attributes sums up and creates a slow and steady memory leak:

[<ROOT vm (3194 refs)>,
 <DATA 0x7fc02024b070 yjit_root (11268 refs)>,
 <IMEMO 0x7fbffcfa5370 ment (4 refs)>,
 <IMEMO 0x7fc01d729090 iseq (491 refs)>,
 <OBJECT 0x7fc01d43c9e0 (0x7fc01d43c800) (6 refs)>,
 <OBJECT 0x7fc01d466740 ActiveRecord::Associations::HasManyThroughAssociation (7 refs)>,
 <OBJECT 0x7fbfd60d1108 Entry (6 refs)>,
 <HASH 0x7fc0203c6198 (11 refs)>,
 <OBJECT 0x7fc01c866418 ActiveRecord::Associations::BelongsToPolymorphicAssociation (5 refs)>,
 <OBJECT 0x7fbfd5f74f58 Message (3 refs)>,
 <HASH 0x7fc01c6feda0 (1 refs)>,
 <OBJECT 0x7fbfe5875af8 ActiveRecord::Associations::HasOneAssociation (4 refs)>,
 <OBJECT 0x7fbfd5f622b8 ActionText::EncryptedRichText (3 refs)>,
 <OBJECT 0x7fbfe5871a98 ActiveModel::LazyAttributeSet (6 refs)>,
 <HASH 0x7fc01c6f9300 (5 refs)>,
 <OBJECT 0x7fbfd5f61228 ActionText::Content (6 refs)>,
 <OBJECT 0x7fbfd5f3da08 ActionText::Fragment (1 refs)>,
 <OBJECT 0x7fbfe58c2948 Okra::HTML::Node (1 refs)>,
 <ARRAY 0x7fbfd5f611d8 (3 refs)>,
 <OBJECT 0x7fbfe58c2998 Okra::HTML::Node (3 refs)>,
 <ARRAY 0x7fbfe5871458 (5 refs)>,
 <OBJECT 0x7fbfe58c2c68 Okra::HTML::Node (3 refs)>,
 <ARRAY 0x7fc01f2a0008 (55 refs)>,
 <OBJECT 0x7fbfe58ab8b0 Okra::HTML::Node (3 refs)>,
 <ARRAY 0x7fbfe58ab9a0 (8 refs)>,
 <ARRAY 0x7fbfd5f5c390 (2 refs)>,
 <STRING 0x7fbfd5f5c3b8 "630">]

This is the node data until you reach the ActiveRecord model instance:

irb#1():068> $diff.after.at("0x7fbffcfa5370").data
=>
{"address"=>"0x7fbffcfa5370",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"ment",
 "references"=>["0x7fc01d578660", "0x7fc01d5d6c38", "0x7fc01d729090", "0x7fc01d854050"],
 "file"=>"/usr/local/bundle/ruby/3.3.0/bundler/gems/rails-139c5678aa5d/activerecord/lib/active_record/relation/query_methods.rb",
 "line"=>1595,
 "method"=>"build_arel",
 "generation"=>18,
 "memsize"=>48,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb#1():069> $diff.after.at("0x7fbffcfa5370").data.except("references")
irb#1():073> $diff.after.at("0x7fc01d729090").data.except("references")
=>
{"address"=>"0x7fc01d729090",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"iseq",
 "file"=>"/usr/local/bundle/ruby/3.3.0/bundler/gems/rails-139c5678aa5d/activerecord/lib/active_record/relation/query_methods.rb",
 "generation"=>4,
 "memsize"=>1736,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}=>
{"address"=>"0x7fc01d43c9e0",
 "type"=>"OBJECT",
 "shape_id"=>5394,
 "slot_size"=>160,
 "class"=>"0x7fc01d43c800",
 "embedded"=>true,
 "ivars"=>18,
 "references"=>["0x7fc000f4c000", "0x7fc007f63168", "0x7fc01d43c760", "0x7fbfec306138", "0x7fc01d466740", "0x7fbfd5fd30f8"],
 "file"=>"<internal:kernel>",
 "line"=>48,
 "method"=>"clone",
 "generation"=>31,
 "memsize"=>160,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb#1():072> $diff.after.at("0x7fc01d466740").data
=>
{"address"=>"0x7fc01d466740",
 "type"=>"OBJECT",
 "shape_id"=>5439,
 "slot_size"=>160,
 "class"=>"0x7fbfed5b4ed8",
 "embedded"=>true,
 "ivars"=>12,
 "references"=>["0x7fc01c55e108", "0x7fbfd60d1108", "0x7fbfd5fdad30", "0x7fbfd5fdad08", "0x7fc01d463b80", "0x7fc01d466560", "0x7fc01d466380"],
 "file"=>"/usr/local/bundle/ruby/3.3.0/bundler/gems/rails-139c5678aa5d/activerecord/lib/active_record/associations.rb",
 "line"=>320,
 "method"=>"new",
 "generation"=>31,
 "memsize"=>160,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb#1():073>
irb#1():075> $diff.after.at("0x7fbfd60d1108").data
=>
{"address"=>"0x7fbfd60d1108",
 "type"=>"OBJECT",
 "shape_id"=>6155,
 "slot_size"=>40,
 "class"=>"0x7fbfed48ed88",
 "ivars"=>26,
 "references"=>["0x7fbfdee4b420", "0x7fc01e29a820", "0x7fc0203c6198", "0x7fbfecef2820", "0x7fbfeceef0a8", "0x7fbfecee2038"],
 "file"=>"/usr/local/bundle/ruby/3.3.0/bundler/gems/rails-139c5678aa5d/activerecord/lib/active_record/persistence.rb",
 "line"=>642,
 "method"=>"allocate",
 "generation"=>31,
 "memsize"=>296,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

The fact that ActiveRecord instances hold references to their data looks good to me, to me looks like the issue is that yjit_root keeps indirectly referencing them, which prevents the objects from being GCed hence the leak. I've tried disabling YJIT on 1 host (hey-default-app-109) and the leak was gone, you can see it from this overnight graph of memory usage in the hosts:

Screenshot 2024-01-19 at 12 01 11

In these hosts we are running YJIT with the default options so just RUBYOPT="--yjit"

Can you help me figure out what's going on here? This issue applies only to a subset of web hosts, which receives specific traffic that creates the "Entry" > "Message" > "Okra" objects.

Happy to tell you any additional info that could help, the next week I'll be OFF so if I won't answer quickly that's the reason 😅. Thanks!

maximecb commented 9 months ago

Hi @intrip,

Thanks for the bug report. I assume this is running on Ruby 3.3.0 ? IIRC we may have had another report of a memory leak.

@XrXr @k0kubun I'm wondering if this might be happening because we root some objects in YJIT when we bake addresses in the machine code, and we now have code GC disabled, so even dead ISEQs could still have code that never gets collected?

maximecb commented 9 months ago

Might be useful if we could come up with a script that reproduces this memory leak to make it easier to debug. Maybe something that creates ActiveRecord models in a loop?

intrip commented 9 months ago

Hi @maximecb,

Thanks for the bug report. I assume this is running on Ruby 3.3.0 ? IIRC we may have had another report of a memory leak.

Yes, it's Ruby 3.3.0.

To workaround it for now I've reduced the YJIT memory to 32Mib and that mitigated greatly the issue, but it dropped a bit the ratio_in_yjit (from 98 to 95). I've also noticed that enabling GCs in YJIT mitigated it but led to worse overall performance so for now I've just reduced the overall memory.

We noticed it now because it became worse recently (not sure if it's related to the Ruby upgrade though) but TBH it's been a while that we were affected by this issue, at least 90+ days.

Might be useful if we could come up with a script that reproduces this memory leak to make it easier to debug. Maybe something that creates ActiveRecord models in a loop?

I'll give it a try but I'm about to leave now and I'll be OFF the next week so it might take some time sorry 😅.

maximecb commented 9 months ago

No problem. I wish we could fix the issue faster :)

casperisfine commented 9 months ago

@intrip do you make any use of Relation#extending? Or other ways to extend a module in relations?

casperisfine commented 9 months ago

Also, I think if you were to strip all the T_STRING value fields, you should be able to share the dump without leaking any personal data. Either publicly or in direct to one of us, that would be helpful to let us dig in the heap dump.

casperisfine commented 9 months ago

Might be useful if we could come up with a script that reproduces this memory leak to make it easier to debug.

I suspect it's the same bug than https://bugs.ruby-lang.org/issues/19436, but with YJIT (and worse because the references accumulate instead of only retaining the last one).

maximecb commented 9 months ago

I just remembered that Aaron @tenderworks had mentioned a memory leak with YJIT 3.3 and Mastodon. This may be related. Would be good if we could get more info about the Mastodon issue.

k0kubun commented 9 months ago

I'm wondering if this might be happening because we root some objects in YJIT when we bake addresses in the machine code, and we now have code GC disabled, so even dead ISEQs could still have code that never gets collected?

They've been running 3.3.0-preview3 since November https://github.com/Shopify/ruby/issues/548#issuecomment-1823152356, which had code GC disabled unlike preview2. So I'm not sure if it explains the situation.

Also, Blocks are deallocated when the ISEQ is GCed (along with their gc_obj_offsets), so "dead" ISEQs would not mark baked objects whether Code GC is disabled or not.

Anyway, a local repro would be really helpful for identifying the cause.

casperisfine commented 9 months ago

I suspect it's the same bug than https://bugs.ruby-lang.org/issues/19436, but with YJIT

Alright, that's what I suspected:

module Foo
  def bar
  end
end

def call_bar(obj)
  # Here the call cache we'll keep a ref on the method_entry
  # which then keep a ref on the singleton_class, making that
  # instance immortal until the method is called again with
  # another instance.

  # The reference chain is IMEMO(callcache) -> IMEMO(ment) -> ICLASS -> CLASS(singleton) -> OBJECT
  obj.bar
end

obj = Object.new
obj.extend(Foo)

5.times do
  call_bar(obj)
end

id = obj.object_id
obj = nil

4.times { GC.start }
begin
  ObjectSpace._id2ref(id)
  puts "MEMORY LEAK!!!"
  exit!(1)
rescue RangeError
  puts "no leak"
  exit!(0)
end
$ ruby -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
no leak
$ ruby -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
no leak
$ ruby -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
no leak
$ ruby --yjit --yjit-call-threshold=1 -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [arm64-darwin23]
MEMORY LEAK!!!
$ ruby --yjit --yjit-call-threshold=1 -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [arm64-darwin23]
MEMORY LEAK!!!
$ ruby --yjit --yjit-call-threshold=1 -v /tmp/repro.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [arm64-darwin23]
MEMORY LEAK!!!

So @intrip is very likely using Relation#extending, which is a giant footgun and we really should deprecate that API in Rails (Ref: https://github.com/rails/rails/pull/47314)

maximecb commented 9 months ago

Following up on the discussions we had in meetings. Jean said that call caches must be weak references. @xrxr do we root objects / CMEs when compiling Ruby calls in YJIT? If so, could we make them weak references?

XrXr commented 9 months ago

We do root CMEs because we root everything we refer to from generated code. Making them weak references will take some effort, though. Even with a constant pool we can't directly make use of the current mechanism that sets the VALUE to Qundef when the referent dies, because we have a GC yield point before writing out CMEs. If the CME gets collected during the GC we end up pushing a corrupted frame.

byroot commented 9 months ago

Would not compiling singleton instance methods be an easy workaround? Their perf already suck on the interpreter, so doubt it would impact YJIT performance noticeably.

And these methods are very likely not to survive long anyway, so compiling them is a bit of a waste.

XrXr commented 9 months ago

Would not compiling singleton instance methods be an easy workaround? Their perf already suck on the interpreter, so doubt it would impact YJIT performance noticeably.

Should work if perf impact is indeed negligible. Hard to tell without experimenting, though. I guess we need to stop compiling calls on all objects with a singleton class, because we root the singleton class itself to do the known class guard, so it doesn't really matter where the method is. Here's a modified version of your script:

def call_bar(obj)
  obj.itself
end

obj = Object.new
p obj.singleton_class

5.times do
  call_bar(obj)
end

id = obj.object_id
obj = nil

4.times { GC.start }
begin
  ObjectSpace._id2ref(id)
  puts "MEMORY LEAK!!!"
  exit!(1)
rescue RangeError
  puts "no leak"
  exit!(0)
end
$ ruby --yjit-call-threshold=1 test.rb
#<Class:#<Object:0x000000010313c0f0>>
MEMORY LEAK!!!
$ ruby test.rb
#<Class:#<Object:0x00000001029cb208>>
no leak

So it ends up being quite a drastic change. It'd also make a lot of existing tests useless since main has a singleton class :/

byroot commented 9 months ago

It'd also make a lot of existing tests useless since main has a singleton class :/

main could likely be handled differently, just like Class and Module (given it's legit and common for them to have a singleton_class).

Also I'm merely suggesting this has a quick fix that would be backportable to 3.3, not saying it's an ideal solution.

byroot commented 9 months ago

because we root the singleton class itself to do the known class guard

Oh. So that's worse than I thought :/

maximecb commented 9 months ago

@XrXr can you explain in more detail why it's problematic to make the singleton class refs weak references?

In theory, if a singleton class is collected, that means it's dead, and it can't be called anymore, no?

If the machine code gets patched and the singleton class ref is replaced by Qnil using an atomic write, that will just cause chain guards to fail, which is to be expected since the class should already be dead by that point?

XrXr commented 9 months ago

The GC can't patch the generated code because there is no write permission, so we need a constant pool.

eregon commented 9 months ago

Maybe for YJIT, like for TruffleRuby, the easier way to deal with this would be to make the Class#attached_object a weak reference? This was talked about on https://bugs.ruby-lang.org/issues/19436. For TruffleRuby I think a weak reference in inline caches would always be a too big overhead (extra memory and extra indirection) to be acceptable performance-wise.

byroot commented 9 months ago

make the Class#attached_object a weak reference? This was talked about on https://bugs.ruby-lang.org/issues/19436.

I think it's something that would be positive, regardless of this YJIT problem. But the compiled code would still leak T_CLASS instances, which would still wouldn't be great :/

eregon commented 9 months ago

But the compiled code would still leak T_CLASS instances, which would still wouldn't be great :/

Ah, right.

Maybe Code GC or so could remove such compiled code once it notices it refers to a singleton T_CLASS with its instance GC'd?

Code GC would also be useful for cases like test suites which might create many anonymous modules/classes which don't live long but yet might be used a few times and some methods on them might end up being compiled.


IIRC the JVM considers references (object constants) embedded in JITed code to all be weak, i.e., if one of these referenced objects would GC then that code would be removed. (and if a thread is currently executing that JITed code it'd deopt to the interpreter I guess, which might work because the thread might already be in safepoint for this part of GC). I think on TruffleRuby specifically this is not enough because we hold on the Ruby Class in the Truffle AST (inline caches are just part of the Truffle AST).

byroot commented 9 months ago

Maybe Code GC or so could remove such compiled code once it notices it refers to a singleton T_CLASS with its instance GC'd?

For now YJIT's code GC just drop all the compiled code, so the cost/benefit isn't great for most apps, hence why it's disabled by default in 3.3. If the code GC becomes more advanced in the future (e.g. capable of freeing only the dead code), then yeah maybe.

As Alan mentioned, short/medium term a constant pool is more likely to happen.

That's why I suggested just not compiling anything that deals with a singleton class (unless the attached object is main of a Module), as an instance having a singleton class strongly suggest that whatever code is compiled for it will soon be dead anyway, and probably won't be called enough to pay for the compilation. That also pretty much automatically prevent dynamic code generation issues, such as OpenStruct.

Of course that's an heuristic, and there will be counter examples of singleton objects that are stable, but it should hold reasonably well.

maximecb commented 9 months ago

The GC can't patch the generated code because there is no write permission, so we need a constant pool.

We do already change permissions on the generated code for movable GC objects though, no?

Otherwise, if we did have a constant pool, would you foresee any challenges with respect to making the reference weak references for chain guards?

XrXr commented 9 months ago

We do already change permissions on the generated code for movable GC objects though, no?

We do, but the weak reference part of the GC is different. We get a callback for moving objects, and that allows us to flip permission for the places we write. For weak references there is no callback, and the GC can't reasonably distinguish between a weak reference in generated code versus one from somewhere else that it can write to without issues. Unless we want to flip permission on everything every sweep step, which doesn't sound great.

Otherwise, if we did have a constant pool, would you foresee any challenges with respect to making the reference weak references for chain guards?

The problem I mentioned in https://github.com/Shopify/ruby/issues/552#issuecomment-1904748960 could be worked around. We'd also need to rework how we track CME validity to make things fully weak.

maximecb commented 9 months ago

For weak references there is no callback

It might be possible to implement such a callback?

@xrxr You understand the situation better than I do. What is your intuition as to what is the best solution to solve this memory leak problem?

XrXr commented 9 months ago

We want to have something we could backport for 3.3, so I'll try and benchmark rejecting calls through singleton classes. In the long term we'll of course go with some sort of weak reference based solution.

XrXr commented 9 months ago

Benchmark results:

yjit-pre: ruby 3.4.0dev (2024-01-23T19:00:05Z master 557b69e83b) +YJIT [arm64-darwin23]
yjit-post: ruby 3.4.0dev (2024-01-23T19:33:36Z yjit-stop-singleto.. af9a00414b) +YJIT [arm64-darwin23]

--------------  -------------  ----------  --------------  ----------  -----------------  ------------------
bench           yjit-pre (ms)  stddev (%)  yjit-post (ms)  stddev (%)  yjit-post 1st itr  yjit-pre/yjit-post
activerecord    15.9           1.2         17.8            0.7         0.94               0.89              
chunky-png      357.2          1.3         369.4           0.5         0.95               0.97              
erubi-rails     558.8          0.5         590.0           0.6         0.95               0.95              
hexapdf         880.6          1.5         912.7           1.4         1.00               0.96              
liquid-c        23.1           1.6         23.7            1.0         1.04               0.98              
liquid-compile  25.5           5.7         26.9            0.8         0.96               0.95              
liquid-render   34.8           1.7         36.1            1.0         0.97               0.96              
lobsters        397.8          1.9         420.0           2.4         1.09               0.95              
mail            49.1           0.8         48.5            0.8         0.99               1.01              
psych-load      857.1          0.6         855.6           0.4         1.00               1.00              
railsbench      598.0          1.5         644.0           1.8         0.96               0.93              
rubocop         58.2           4.6         58.5            6.0         1.00               0.99              
ruby-lsp        51.9           1.1         53.2            3.1         1.08               0.97              
sequel          27.9           0.8         31.4            6.6         0.87               0.89              
--------------  -------------  ----------  --------------  ----------  -----------------  ------------------

Not horrible, but we definitely take a hit. Is it acceptable?

maximecb commented 9 months ago

Yikes, that is actually a pretty significant performance impact. I would prefer to avoid it.

XrXr commented 9 months ago

Ah, there seems to be no perf hit if I amend the logic to accept class methods. Classes and modules tend to live forever anyways so this should still address the leak:

yjit-pre: ruby 3.4.0dev (2024-01-23T19:00:05Z master 557b69e83b) +YJIT [arm64-darwin23]
yjit-post: ruby 3.4.0dev (2024-01-23T19:33:36Z yjit-stop-singleto.. af9a00414b) +YJIT [arm64-darwin23]

--------------  -------------  ----------  --------------  ----------  -----------------  ------------------
bench           yjit-pre (ms)  stddev (%)  yjit-post (ms)  stddev (%)  yjit-post 1st itr  yjit-pre/yjit-post
activerecord    15.7           0.9         15.6            0.8         1.02               1.01              
chunky-png      360.7          4.0         372.7           0.3         0.95               0.97              
erubi-rails     559.8          0.8         548.0           0.6         1.02               1.02              
hexapdf         872.4          1.0         877.0           0.8         1.00               0.99              
liquid-c        23.8           1.5         23.9            3.5         1.21               1.00              
liquid-compile  25.5           5.7         25.8            5.6         0.99               0.99              
liquid-render   34.8           1.7         36.0            2.4         0.99               0.97              
lobsters        393.4          2.5         402.2           2.3         1.04               0.98              
mail            51.4           0.5         51.2            0.9         0.99               1.00              
psych-load      844.7          0.0         838.3           0.1         1.01               1.01              
railsbench      584.1          0.2         578.8           0.5         1.02               1.01              
rubocop         58.6           4.6         58.3            4.4         1.01               1.00              
ruby-lsp        52.0           1.0         51.8            0.9         1.01               1.00              
sequel          28.1           1.1         28.4            1.7         0.99               0.99              
--------------  -------------  ----------  --------------  ----------  -----------------  ------------------
byroot commented 9 months ago

Yes class methods should definitely continue to be compiled. Does your patch also compile methods on main? I'm surprised to see it have an impact on benchmarks at all

XrXr commented 9 months ago

main is just the top level self on the entry point script passed to ruby, so it really shouldn't impact performance. The impact in the chart is just noise on my laptop. On the benchmarking machine it's more clear that there is not much impact, except maybe on erubi-rails.

casperisfine commented 9 months ago

The upstream fix has been merged and is marked for backporting. I think we can close.

intrip commented 9 months ago

Thank you all for the fix! Limiting YJIT memory worked around it for now, and our ruby build tool doesn't support -dev yet (only rc/preview so far) so I'll wait for a ruby bump to test the fix 😅.

@intrip do you make any use of Relation#extending? Or other ways to extend a module in relations?

We use it but rarely: I only found one usage in the app codebase, but it doesn't match with the relation referenced in the heap dump.

casperisfine commented 9 months ago

Interesting, there might be something else causing relations to have a singleton class. No has_many extend: or has_many do either? I forgot to single these out, but the rely on extending too.

Also did you look in your gems? In https://github.com/mastodon/mastodon/pull/28013 I found that kaminari makes heavy use of it.

intrip commented 9 months ago

Interesting, there might be something else causing relations to have a singleton class. No has_many extend: or has_many do either? I forgot to single these out, but the rely on extending too.

Oh good point. I found one more relation so now the total is 2, this one uses has_many extend: but again doesn't match with the Entry > Message path in the heap dump 🤔.

Also did you look in your gems? In mastodon/mastodon#28013 I found that kaminari makes heavy use of it.

I haven't found any outlier in the installed gems via ag '(\.extending|has_.*extend:|scope :.*do\n)' $(bundle show --paths | grep -v rails). For pagination, we rely on https://github.com/basecamp/geared_pagination which shouldn't be affected.

casperisfine commented 9 months ago

kk. I wish there was a way to check if an object has a singleton class to debug this further. Unless I'm missing something I think a C ext is needed for that today.

Would be interesting to monkey patch Relation in your app to figure out where this is coming from. But ultimately it's mostly for intellectual curiosity because Alan's patch totally fix this concern, so no big deal.