colinmollenhour / Cm_Cache_Backend_Redis

A Zend_Cache backend for Redis with full support for tags (works great with Magento)
Other
390 stars 142 forks source link

Added 'retry_reads_on_master' option for when read slaves are used #134

Closed davidalger closed 5 years ago

davidalger commented 5 years ago

I recently supported the infrastructure launch on a Magento 2 site which sees high enough traffic levels to require the use of a read slave for the object cache. Without the slave, cache reads will saturate the 1Gbps interface between the web servers and the master Redis instance.

During pre-launch load testing, everything worked great and we reached proper traffic levels by simply adding a read slave to each of the web servers and configuring the load_from_slave option on the cache backend. This eliminated the network interface as a bottleneck as well as lowered overall application response times by around 50-60 ms on average (because there is practically no latency to redis on cache hits now). But this was without a long-list of heavy hitting integrations that result in lots of cache purging activity for short periods of time (integrations which are out of our control).

Once live we kept experiencing instances of application responses spiking. They were seemingly very random and didn't always correlate to said integrations, but did seem to be either timing wise around these integrations slamming the API and/or admins editing products in the backend.

When a cache flush occurs, there is always a cache flood, especially when under high traffic. The net effect of this is compounded when read slaves are used. Under the circumstances we experienced the performance degradation severe enough to grind things to a halt a couple times during a flash sale and the impact to the redis replication was a toppling effect resulting in floods of LOADING Redis is loading the dataset in memory errors as they continued trying to perform full re-synchronizations due to the cache floods filling replication buffers of over 4 GB in size.

The solution to this problem was to add support for the retry_reads_on_master setting to make configurable the retry of reads on master when the read comes back from the slave as empty. This has been in production for 5 days now and the error has not reoccurred.

The below is a screenshot of what happens to the object cache during these steady and repeated cache floods. The giant spike in the purple graph is not an increase in data, but a replication buffer filling up to over 4 GB of writes (from the cache floods)

magento overview created via terraform datadog 2018-09-21 08-49-52

The solution of simply retrying empty read responses on master has the effect of bringing the impact of a cache flood back to where it was when only a master redis instance was used, allows us to put the replication buffer back down to 768 mb (it could probably go even lower without issue, but we're stable now, so I'm not changing it).

These are the settings we used on the replicating redis instances:

repl-diskless-sync yes
repl-backlog-size 256mb
client-output-buffer-limit slave 768mb 768mb 0

Cache backend configuration in the env.php file now looks like this:

  'cache' => 
  array (
    'frontend' => 
    array (
      'default' => 
      array (
        'backend' => 'Cm_Cache_Backend_Redis',
        'backend_options' => 
        array (
          'server' => '940365-masterdb',
          'port' => '6379',
          'load_from_slave' => 'tcp://127.0.0.1:6379',
          'retry_reads_on_master' => '1',
        ),
      ),
    ),
  ),

Let me know if there are questions. I'd love to see this merged in so it can trickle it's way down into the core and allow us to eventually remove our package override in the composer.json for this site. :)

colinmollenhour commented 5 years ago

Hi David, have you tried this? https://gist.github.com/colinmollenhour/7a91c4a92ccfd2adaeb6

It makes a huge difference for stability but with a high-traffic site you still want to avoid ever clearing the config cache but instead replace it.

The PR looks good though so no objection to merging it.

davidalger commented 5 years ago

@colinmollenhour This site is running Magento 2 not Magento 1 so I haven't tried that. I may look through it though and see how difficult it would be to port it to M2 if we continue running into issues here, so thanks for the tip nonetheless!

colinmollenhour commented 5 years ago

Ahh yes, you did say M2. No idea what the config cache loading looks like in M2 but the three major issues I've seen with M1 config cache are:

davidalger commented 5 years ago

@colinmollenhour Thanks for merging this. Would it be possible to tag a new release version so I can send a PR updating the composer.json dependency on the Magento 2 project? Something tells me specifying a commit hash or dev-master won't fly.

colinmollenhour commented 5 years ago

Tagged 1.10.6

boldhedgehog commented 5 years ago

@davidalger thanks for your work. We still experience the "LOADING" issue, however. You fix only checks if the response is a FALSE, but in our case Credis library throws and Exception, and it is not handled, and the read is not retried from master.

                if($this->isMulti || $this->usePipeline) {
                    $response = FALSE;
                } else if ($name == 'evalsha' && substr($reply,0,9) == '-NOSCRIPT') {
                    $response = NULL;
                } else {
                    throw new CredisException(substr($reply,0,4) == '-ERR' ? substr($reply, 5) : substr($reply,1));
                }

As you may see here, if the request is neither pipelined nor multi an exception is thrown, and what I saw from the code hGet() call does not match the condition.

Do you have any advice for me?