Closed tycooon closed 4 years ago
is it reproducible?
Not sure, this happened in production only 1 time in like 7 days after we upgraded to ruby 2.7. I guess there was some specific response from the server (which I don't have of course).
I don't really know where to start in solving the problem, if it's a problem. When there are multiple threads running if a seg fault occurs on another thread (in a c-lib for instance and hence not blocked by the GIL) the control frame information is usually inaccurate.
As it stands the control frame points to a function assigning a field in an FFI::Struct
which seems fine.
Don't even think there would be an issue if different fields in that struct were being assigned in parallel.
However ruby 2.7 introduced compaction garbage collection. Meaning it moves objects in memory, which seems dangerous when dealing with c-code. So it might be a race condition caused by that? It's probably unlikely that the code referenced is directly to blame even if the fault occurred because of that operation.
Without some way of reproducing the event I don't think we are going to solve it and my guess is when it happens next it'll point to a totally different location. If it does happen again though, please update this issue as we also use this in production :)
OK, sure. Thanks for your feedback!
Unfortunately, we have 2 more segfaults since last time, and both happened in http-parser ☹️ Please take a look:
https://pastebin.com/raw/LgrLkHUv https://pastebin.com/raw/ittHt6B3
@larskanis any ideas? I really have no idea, the segfaults seem to be occuring when writing to FFI structs. My only thought is that the compacting GC on Ruby2.7.0 is moving things and that's a total stab in the dark
Compacting GC doesn't start unless explicit triggered. @tycooon do you call GC.compact
somewhere? Unfortunately none of the stack traces shows ffi_c.so
. So possibly this could be a ruby core bug. I don't have an idea so far, without a reproducible script.
It turns out we were experimenting with adding the following initializer to our Rails project:
Thread.new do
loop do
result = GC.compact
Rails.logger.debug "#{Process.pid} Run GC.compact with result: #{result.inspect}"
sleep 60
end
end
And all 3 segfaults that we got so far happened right after we deployed that version. So I think the problem was exactly us calling GC.compact
. We currently don't have this code in the project, so I guess there should not be more segfaults.
I guess I should report this problem to the Ruby issue tracker, but I have no idea how to make it reproducible.
The related ruby issue describes a lot of details about compacting GC and has a known issue regarding C extensions. Possibly ffi is affected by it. Using GC.verify_compaction_references
instead of GC.compact
should trigger this kind of bugs more reliable.
Thanks for the explanation! Closing this for now 🎉
ffi-1.12.2 fixes this issue. See https://github.com/ffi/ffi/issues/742.
@tycooon You can re-add GC.compact
to your project!
Great news, thanks! 🎉
Here is a backtrace with all app-specific information removed.