Closed jonnythejap closed 11 years ago
An update on this.. it happened again, shortly after I went to bed, so a good 5 hours of mostly down.. some pages got loaded, most didn't. Looking through the exception reports just before it happened was this unusual guy...
a:5:{i:0;s:43:"protocol error, got '?' as reply type byte ";i:1;s:2497:"#0 /var/www/html/magento/lib/Mage/Cache/Backend/Redis.php(210): Credis_Client->__call('hGet', Array)
And then after that, every connection would throw this error until it started throwing the connection error..
a:5:{i:0;s:24:"read error on connection";i:1;s:4126:"#0 /var/www/html/magento/lib/Mage/Cache/Backend/Redis.php(210): Credis_Client->__call('hGet', Array)
Edit: Looking into it more, I am not sure it is related. The previous instance where this happened, no such error about the reply byte came in before all of it started, just straight to the read errors. We're using the phpredis module. I just changed the configuration on the full page caching to force standalone to see if the errors happen through just that Credis_Client class.
Never seen that before but sounds like a Redis server issue.. E.g. full disk, corrupt memory, segfault, etc...
44 gigs free on disk, 6 gigs of memory free (out of 10). We're hosting with rackspace, so it might be corrupt memory but was thoroughly tested before we started using this server in January with no problems, and all other instances on this hypervisor is running without issue. We have the full page cache redis instance separate from the regular cache, and the regular cache hasn't had an issue yet. There is no errors in the redis log, just logs of when it dumps to disk or when I had to restart it to get it going again. Been running with no issues until a day after we upgraded to 1.13... so either a difference in the Redis caching stuff from what the version packaged with 1.13 and what we were using previously? Or an issue in the enterprise full page caching that isn't playing nicely with the redis backend?
The only material change since EE started using the renamed Cm_Cache_Backend_Redis is 28e975c. When you started getting the errors, was the Redis process still running or had it died?
Still running. Could connect to the port and run basic commands with no problem from the web servers and from the local host of the redis server. Didn't think to netstat it to see how many existing connections. Just restarted the service to get it working again. On May 28, 2013 7:51 AM, "Colin Mollenhour" notifications@github.com wrote:
The only material change since EE started using the renamed Cm_Cache_Backend_Redis is 28e975chttps://github.com/colinmollenhour/Cm_Cache_Backend_Redis/commit/28e975c. When you started getting the errors, was the Redis process still running or had it died?
— Reply to this email directly or view it on GitHubhttps://github.com/colinmollenhour/Cm_Cache_Backend_Redis/issues/37#issuecomment-18551528 .
If you can connect, you might try issuing an INFO command to see if there is anything unusual. Also I'd monitor with munin (link to script in README)
INFO command was the first thing I ran the first time it happened when trying to figure out what was initially going wrong that brought the site down... nothing unusual from what I saw. Just setup munin to monitor it and will update with anything unusual the next time it happens. Since switching over to the standalone instead of phpredis for handling the connection, its been going for almost 9 hours without another recurrence of it, and I'm hoping it won't happen again. Also, in case it might make a difference, using snappy for compression. Thanks for the input so far.
The issue hasn't come up again and now been over 24 hours since it last occurred. The change which I did on my end was setting the force_standalone option to 1 to avoid going through phpredis. If the problem doesn't recur, then there is a potential issue with the Credis_Client or the cache data that magento is generating or something when using the redis class from the phpredis extension to handle the connection instead of the standalone.
Interesting. phpredis is quite a bit faster so if you can get it working that would be best. Where did you install phpredis from? I'm using the latest version that was tagged recently (2.2.3) and it seems pretty stable.
The version I had was 2.2.2, pulled down in December from nicolasff's github. Currently running php 5.3.24. I'll try the latest version of php 5.3.x and compile 2.2.3 to see if that will work and be stable.
Hi guys, we seem to be having a similar issue as well. Upgraded to the EE 13.0 version, upgraded phpredis and getting read on error connection. It seems to happen randomly when the guys are adding products and saving. It doesn't seem to bring the site down though. We are on our own VPS with PHP 5.4.15. I'm going to do the update to EE 13.0.1 tonight and see if that gives any change. If I understand correctly, it falls back and uses mysql? I'll try the force_standalone setting as well.
We are hitting the problem still, but only for about 30 second periods when doing things like applying catalog rules which, I assume, is causing a lot of entries to get invalidated in the full page cache. Doesn't happen with the regular cache. If it happens on the FPC it won't default back to the database, or bypass the FPC completely and just fulfill the request, it throws a 500 series error with the ugly magento exception error page.
@DerekMarcinyshyn The cache backend will not fallback on errors, it will throw an exception.
@jonnythejap This sounds like an issue that should be reported to either redis or phpredis. I experience read errors myself occasionally (in Cm_RedisSession) and don't know exactly why but my best guess is the redis server. If the connection is lost the connection is lost so I don't see how this could be a bug in my PHP code.
I don't think the cache backend is the proper place to handle errors; they should be thrown as they are and caught somewhere in a higher level. Perhaps the FPC module should be modified to log the error and generate a fresh page rather than let the error bubble all the way up to the user.
I had this exact same issue as well, and using phpredis instead of the Credis fallback solution solved the issue here.
I've been using the Redis caching module since last August with no difficulties. We just did an update to EE 1.13, and switched over to using the one that came packaged with Magento this last Saturday night, and this problem has happened twice now bringing the site down, once at midnight Monday morning, and then again at 11:11 PM Monday night (I noticed it was the same class essentially, but changed the configuration to point to the Mage one in lib just in case there was an update after the first time this problem happened that might have been the cause).
The site works fine until... well, I'm not sure what, and that is what I'm hoping to find out or get pointed at something to start figuring out what is wrong. Like a switch, all the page requests start returning with errors. A sample of them is here:
a:4:{i:0;s:44:"Connection to Redis failed after 2 failures.";i:1;s:1904:"#0 /var/www/html/magento/lib/Credis/Client.php(378): Credis_Client->connect()
1 /var/www/html/magento/lib/Credis/Client.php(463): Credis_Client->connect()
2 /var/www/html/magento/lib/Mage/Cache/Backend/Redis.php(210): Credis_Client->__call('hGet', Array)
3 /var/www/html/magento/lib/Mage/Cache/Backend/Redis.php(210): Credis_Client->hGet('zc:k:449_FPC_DE...', 'd')
4 /var/www/html/magento/lib/Zend/Cache/Core.php(303): Mage_Cache_Backend_Redis->load('449_FPCDESIGN...', false)
5 /var/www/html/magento/lib/Varien/Cache/Core.php(158): Zend_Cache_Core->load('FPC_DESIGN_EXCE...', false, false)
6 /var/www/html/magento/app/code/core/Mage/Core/Model/Cache.php(379): Varien_Cache_Core->load('FPC_DESIGN_EXCE...')
7 /var/www/html/magento/app/code/core/Enterprise/PageCache/Model/Processor.php(184): Mage_Core_Model_Cache->load('FPC_DESIGN_EXCE...')
8 /var/www/html/magento/app/code/core/Enterprise/PageCache/Model/Processor.php(145): Enterprise_PageCache_Model_Processor->_getDesignPackage()
9 /var/www/html/magento/app/code/core/Enterprise/PageCache/Model/Processor.php(107): Enterprise_PageCache_Model_Processor->_createRequestIds()
10 /var/www/html/magento/app/code/core/Mage/Core/Model/Cache.php(703): Enterprise_PageCache_Model_Processor->__construct()
11 /var/www/html/magento/app/code/core/Mage/Core/Model/Cache.php(685): Mage_Core_Model_Cache->_getProcessor('Enterprise_Page...')
12 /var/www/html/magento/app/code/core/Mage/Core/Model/App.php(340): Mage_Core_Model_Cache->processRequest()
13 /var/www/html/magento/app/Mage.php(683): Mage_Core_Model_App->run(Array)
14 /var/www/html/magento/index.php(87): Mage::run('default', 'store')
15 /var/www/html/magento/index_redir.php(25): require_once('/var/www/html/d...')
16 {main}";s:3:"url";s:1:"/";s:11:"script_name";s:1:"/";}
The only action I can see in the admin log that happened just before it is a catalog pricing rules trying to be applied, and failing (it seems a bug that if the cron is running when someone tries to apply catalog price rules, it will fail, however, if I turn the off the cron from running in crontab, it always applies no problem). So, maybe something wrong if a reindex process fails (like a catalog pricing rules get reapplied and pricing gets reindexed at midnight automatically I think if I recall correctly, and this rule applying failing probably failed on the reindexing part) something messes up the full page cache that causes the module to hang? I don't know, just guessing at this point.. and the one thing I've done to basically jump start it was restart the redis services... I didn't check trying to go into the admin and flushing the cache.
Anyway, I'm not passing through a firewall between the web servers and the redis server, and am hoping that there might be a thought on how to best go about debugging this or an idea of what might be going wrong that the redis backend suddenly becomes non-accessible. Hopefully someone has some insight and can help.