colinmollenhour / Cm_Cache_Backend_Redis

A Zend_Cache backend for Redis with full support for tags (works great with Magento)
Other
387 stars 143 forks source link

Performance issues with latest minor versions `colinmollenhour/cache-backend-redis` and `colinmollenhour/credis` #181

Open glo71317 opened 9 months ago

glo71317 commented 9 months ago

Hi,

When we ran the composer update then minor version is updated of colinmollenhour/cache-backend-redis and colinmollenhour/credis then we received the large number of performance issues in adobe commerce.

So, during investigation we found these are dependency which was causing issue so we restricted these dependency with older version of both the packages and it worked well to us.

Still more investigation is pending, right now release is very near so we decided to unblock the delivery for upcoming release. will continue investigating once we will get time from current release.

You can also revisit the changes which are introduced in latest version for those packages which i mentioned above. Find the log report.zip

For more detail - https://github.com/magento/magento2/commit/9369884f8c40745291347470c5910a0f01a556af#commitcomment-137202896

Xon commented 9 months ago

The lock file indicates you went back from v1.17.1 to v1.16.0

The major performance related change was for lua to be enabled by default in v1.17.0. This means less remote calls to redis to perform cache actions.

glo71317 commented 8 months ago

@Xon You are correct! Thanks for highlighting it.

@colinmollenhour We had investigated bit more today and decided to rolled back your small change https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/pull/182/files and run our build against created 182 after updating the minor version of both library colinmollenhour/cache-backend-redis and colinmollenhour/credis.

Screenshot 2024-02-22 at 3 39 25 PM

Performance issue got sorted. so i think we can conclude the major performance related change was for lua to be enabled by default in v1.17.0.

Screenshot 2024-02-22 at 3 39 57 PM

Please take a look and suggest the way forward!

colinmollenhour commented 8 months ago

@glo71317

Note that due to lack of transactions, not using Lua mode introduces potential for race conditions. These race conditions would probably not have much impact but could occasionally cause odd side-effects like a cache record that doesn't get cleared when the tag is cleared. Also, I think very large caches have trouble clearing and doing gc reliably without Lua mode..

It may be that the best solution is to not use Lua when saving cache records since these are many small operations, but do use it when clearing tags and gc.

Are you able to pinpoint what exactly is slower with Lua enabled? Does your test framework have an unrealistic number of cache saves and tag clears? Lua mode definitely should not hurt read performance and a healthy store should have easily way more reads than writes.

Xon commented 8 months ago

If anything the performance impact should be from disabling lua, The lockfile went from having lua enabled to lua disabled due to the version change

glo71317 commented 8 months ago

@colinmollenhour You might be right, lua should not be enabled in minor version, best way to introduce Backward compatibility issues in major version. Normally restrict to minor or patch version of any library is not good way to resolve the issue.

Right now it is time consuming to pin point the issue in adobe commerce. Our suggestion would be disabling lua because many other users are also impacting.

colinmollenhour commented 8 months ago

I didn't consider this a backward-compatibility issue as the lua mode has been present for a very long time and no code APIs changed. It is unexpected that it had a noticeable negative impact on performance, though. It was released as a minor version update and not a patch update so you could lock your version to ~1.16.0 allowing for future patch updates while not upgrading to 1.17.0.

hostep commented 8 months ago

It appears Magento 2.4.7 will come with a configuration option to enable or disable lua: https://github.com/magento/magento2/commit/b5ede695df511eb42e5a93969e29fdbb3d02eb77

colinmollenhour commented 8 months ago

Thanks for the update, @hostep

@glo71317 I'd still appreciate any insight into under what situations the lua flag is a net negative if you have any more details. If the read:write ratio is "normal" compared to what I've seen in my experience the impact should be minimal so I suspect it is unique to the test environment, but I've been wrong before. :)

glo71317 commented 8 months ago

@colinmollenhour As we had adapted the lua changes at our end, By default lua will be false in adobe commerce when redis is enabled if someone want use the lua flag enabled then end user can make it enabled lua flag and continue to achieve their requirements. Even you can close this issue from our end. Thanks for the update!

hostep commented 8 months ago

@glo71317: I agree with Colin in that it would be a good idea to dig a bit deeper in Magento's cache usage, to figure out exactly what is the root cause of this problem instead of hiding it with a flag.

glo71317 commented 8 months ago

@hostep we are not hiding anything will speak to PO for investigating more bit in deep to share read:write ratio. But surely will do after current release finish.

igorwulff commented 6 months ago

@hostep We have seen similar issues with Redis after use_lua was by default set to 1. We had to set it to 0 again, to stabilize Redis read timeout errors and other errors from popping up. The backend_clean_cache cron from Magento also started to fail after use_lua=1 was set by default.

Nuranto commented 6 months ago

maybe related : https://github.com/magento/magento2/issues/38668

damienwebdev commented 3 months ago

I think I have a valid reproduction of this on Magento b2.4.6-p6 and colinmollenhour/cache-backend-redis:v1.17.1

My redis monitor is full of:

1723139520.164118 [0 IP:PORT] "evalsha" "1617c9fb2bda7d790bb1aaa320c1099d81825e64" "3" "40d_CONFIG_SCOPES" "40d_CONFIG" "40d_MAGE" "zc:k:" "d" "t" "m" "i" "zc:tags" "zc:ti:" "zc:ids" "40d_SYSTEM" "gz:..." "40d_CONFIG_SCOPES,40d_CONFIG,40d_MAGE" "1723139520" "1" "" "0" 1723139520.778961 [0 lua] "HMSET" "zc:k:40d_SYSTEM" "d" "gz:..." "t" "40d_CONFIG_SCOPES,40d_CONFIG,40d_MAGE" "m" "1723139520" "i" "1"

Upon repeated attempts to make cacheable GET requests. I would assume that the contents of the cache contain something that causes a hash to be mismatched for the same content over and over.

I'm going to try to dig further to see what (in these cache keys) is causing a miss.

colinmollenhour commented 2 months ago

Thanks for the info, @damienwebdev

That's the save() command for the 'SYSTEM' key. Are all the updates referencing the '40d_SYSTEM' value in the same position or is that value not repeated at all or rarely?

The Magento config (which consists of many sections/scopes each with their own key) should be treated as basically immutable, there should be absolutely no code that is changing it directly; only a user-triggered config update or new code deployment should touch the config. I've seen extensions that do very naughty things like using the config to store state data for frequently updated state like after a cron job runs - this is what core/flag is good for (in M1, not sure what the M2 equivalent is called). Also, it is bad to use "Flush" on a busy site as it is possible to cause a "stampede" of updates (this was fixed thoroughly in OpenMage, not sure about other variants). It's even worse when there is a config flush buried in an extension somewhere..

If users are updating the config frequently, it is true that it is not a well-optimized scenario as Magento re-saves the entire config even if only a single value is changed so this too should be avoided. E.g. perhaps some config data should instead live in separate tables and not the config. This really applies to any backend, but the Redis lua mode does sacrifice some save performance for stability - without it there is some possibility for race conditions since it requires multiple updates. In general, the assumption is that saves are rare and reads are not, so if saves are not rare, then lua mode will not be a good candidate.

If the core uses a pattern that needlessly causes lots of config updates, then I'd argue that should be fixed first.

Also, I've never checked this in M2 but in OpenMage the config uses an infinite lifetime to prevent the config from being evicted - only an intentional act should clear the config. Lastly, the "allkeys-random" eviction policy should definitely not be used.

However, as I've looked back on this, the current lua implementation does everything in one call but since it involves extra encoding/decoding steps and the config data can be quite large for some stores, it may make sense to split it into two calls: one that saves the key data (the first hmset) and then the evalsha to update all of the tags. I think only an implementation comparing benchmarks can answer that.. I'm not planning on tackling this at the moment, just to be clear.

damienwebdev commented 2 months ago

@colinmollenhour after digging into this further, it appears my issue is caused by missing the retry_reads_on_master and then triggering a cache clear. Even on a site with absolutely no volume, I was able to see this behavior on a system with 1 master and 9 replicas simply by loading pages in the Magento admin.

Once I added that config, my specific issue resolved. Apologies for the noise.

colinmollenhour commented 2 months ago

Ahh, that will do it, sounds like the replication latency caused a cache saving stampede. OpenMage fixed this in https://github.com/OpenMage/magento-lts/pull/3533/files even without retry_reads_on_master by using MySQL's GET_LOCK. I'm not sure what the M2 implementation is doing.

I tested the idea of moving the hmset that does the main data save outside of the lua script (see https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/tree/lua-save-performance) and it's surprisingly quite a bit slower so I don't think it's a good improvement even though it would potentially reduce blocking time.. So as of now I don't know of any way to improve either lua or non-lua modes.