random jumps in private memory usage with 0.19.x and passenger

jnewland commented 14 years ago

We're seeing a very strange memory characteristics in memcached 0.19.x releases with Passenger. The private memory of passenger processes will randomly jump in large increments (dozens of megabytes), while the total VMSize only grows slightly.

Here's a graph showing how rolling back to 0.18.0 drastically reduced memory usage of our Passenger processes.

annotated memory usage chart

(The rise and fall of memory usage you see in the app servers running 0.19.2 in that graph is a result of a reaper script we're using to kill passenger processes that leak too much memory)

I'm wondering if anyone is seeing similar behavior, or if this is specific to my environment. We're running:

Ubuntu 8.10
ruby 1.8.7 (2009-12-24 patchlevel 248) [x86_64-linux], MBARI 0x6770, Ruby Enterprise Edition 2010.01
memcached 1.4.2-1
libsasl2-dev 2.1.22.dfsg1-21ubuntu2.1
Passenger 2.2.11

I'm planning to take a hard look at this today, as I'd love to get the retry behavior included in 0.19.3. Initial tests of 0.19.3 show that it has the same strange memory characteristics. Bummer.

A couple questions:

Are there any changes that know of between 0.18.0 and 0.19.2 that would have affected COW-friendly ruby interpreters?
If you're not seeing these memory characteristics in your environment, what version of the sasl headers are you using? The sasl-related commits are an initial suspect in this hunt because they amount for a good chunk of the changes between 0.18.0 and 0.19.2.

Anyway, I'll be staring here for most of the rest of the day, trying to track this down. :) Thanks for any help/insight you might have.

ghost commented 14 years ago

I don't see anything in the Valgrind runs, even with COW turned on. Can you get latest master and try running "rake valgrind" in your production environment? I suspect some weird interaction with your app code.

I am using whatever SASL headers come with OS X Leopard.

evan commented 14 years ago

I may have figured it out. When we had show_backtraces turned on, we had a similar leak in 0.19. I don't know the root cause of that, but there is no reason to run show_backtraces in production in the first place; turning that off improves performance and stops the leak.

jnewland commented 14 years ago

That was it! I see no signs of a leak after on a hour in production with show_backtraces turned off on 0.20.1. The default default exceptions_to_retry/exception_retry_limit settings seem to be doing the trick too; we haven't had a single occurrence of a memcached timeout or the dreaded 'operation in progress' error bubbling up to cause a 500 since rolling this out. Thanks Evan, I owe you several beers. :)

evan commented 14 years ago

Hooray!

arthurnn / memcached

random jumps in private memory usage with 0.19.x and passenger #19