MichaCo / CacheManager

CacheManager is an open source caching abstraction layer for .NET written in C#. It supports various cache providers and implements many advanced features.
http://cachemanager.michaco.net
Apache License 2.0
2.35k stars 456 forks source link

Version 1.1.1 is possibly not working for old keys? (need confirmation) #183

Closed vantheshark closed 6 years ago

vantheshark commented 7 years ago

Hi,

We've updated from 0.8.0 to 1.1.1 and found many issues, unfortunately on our Production >.< CacheManager.Core CacheManager.StackExchange.Redis

"# Server
redis_version:3.0.7
redis_build_id:f7cfd70bacad9a98
redis_mode:cluster
os:Linux 3.10.0-327.13.1.el7.x86_64 x86_64
gcc_version:4.8.5

1/ There are suddenly a lot of errors: Update failed on ':' because of too many retries. There are many old key created by previous version but when running with the version 1.1.1, the problem seems to start to occur and the old key staying there undeleted/unexpired.

More info The hashvalue of the key (probably created by the old version) HasValue

The code that read the key probably check the expiration mode incorrectly, please confirm Code version 1.1.1

When comparing the version, it's getting obvious that my theory is correct :angry:

Breaking change

usesDefaultExpiration is true in this case however, the expirationMode was sliding as created by the old version. I think the cacheItem creation should use the existing expirationMode unless it's "None" and usesDefaultExpiration == true.

// Current code
var cacheItem =
    usesDefaultExpiration ?
    string.IsNullOrWhiteSpace(region) ?
        new CacheItem<TCacheValue>(key, value) :
        new CacheItem<TCacheValue>(key, region, value) :
    string.IsNullOrWhiteSpace(region) ?
        new CacheItem<TCacheValue>(key, value, expirationMode, expirationTimeout) :
        new CacheItem<TCacheValue>(key, region, value, expirationMode, expirationTimeout);

Finally, i saw the code EvictFromOtherHandles if the version is conflicted but for some reasons the key is not deleted from the other Handle.

2/ OutOfMemoryException thrown From BackplaneMessage.ReadBytes

OutOfMemoryException

And it came from here DeserializeMessage

I tried to debug but I'm not really familiar with the code so it might be better for the author to have a look.

Rollback to 0.8.0 and the problem gone. Thanks

MichaCo commented 7 years ago

Hi @vanthoainguyen, Thanks for reporting those issues and sorry for any inconveniences.

Regarding Update failed on ':' because of too many retries, does this really have to do with the expiration settings? Can you give me more information of what key/region in that case is and what the existing hash looks like in Redis?

Regarding the expiration issue:. Yes, since 1.0 there are some changes of how the expiration of cache items in Redis gets computed. That was actually a bug fix as the expiration didn't get applied correctly to cache layers above the Redis layer. I usually try to keep versions compatible, in that case, this change was not, intentionally. The best would be to start with a clean Redis database. Anyways, I'll have a look into that again. Assuming useDefault=true if the flag is not set might be not correct, at least in case the expiration type and timeout is set. Maybe that's a way to make it backward compatible again. I'll have to do some testing though...

Regarding OOM Exception, well, that's pretty hard to tell why this occurs. I did run some long running tests on that with millions of keys and didn't had any issues. But that is highly dependent on the environment and what else goes on. It is usually just coincidence what part of the app throws that OOM exception if the machine is low on memory already. That being said, I know that the code can be optimized by reusing byte arrays as allocating many small byte arrays can fragment the system memory. I'll have a look into that and might have to roll back to the previous version if I cannot make the allocations more efficient.

Can you give me some more information of what your production environment looks like, hardware wise. And, how many keys do you store and maybe how many keys do change over time? How large are the values on each key (roughly).

vantheshark commented 7 years ago

Hi, regarding the OOM exception, it's because of creating a byte array with too large size read from reader.ReadInt() as i show in code. I think it's likely related to the conflicted version issue above and nothing todo with the hardware.

However, i read the code and as i mention i saw the evicfromotherhandles method which suppose to delete the key from memory but for some reason it didn't and the code keep failing updating redis and show such warning about many retries. I actually can reproduce it by storing the key to localhost redis with 0.8.0 and then read it with 1.1.1.

I still think you should initialize the cachitem the way i suggested with whatever expirationmode retrieving from cache rather than relying the usedefault... variable

Anyways, i'll update more once i found out why the eviction didn't work.

On Fri, 28 Jul 2017 at 5:55 pm, MichaC notifications@github.com wrote:

Hi @vanthoainguyen https://github.com/vanthoainguyen, Thanks for reporting those issues and sorry for any inconveniences.

Regarding Update failed on ':' because of too many retries, does this really have to do with the expiration settings? Can you give me more information of what key/region in that case is and what the existing hash looks like in Redis?

Regarding the expiration issue:. Yes, since 1.0 there are some changes of how the expiration of cache items in Redis gets computed. That was actually a bug fix as the expiration didn't get applied correctly to cache layers above the Redis layer. I usually try to keep versions compatible, in that case, this change was not, intentionally. The best would be to start with a clean Redis database. Anyways, I'll have a look into that again. Assuming useDefault=true if the flag is not set might be not correct, at least in case the expiration type and timeout is set. Maybe that's a way to make it backward compatible again. I'll have to do some testing though...

Regarding OOM Exception, well, that's pretty hard to tell why this occurs. I did run some long running tests on that with millions of keys and didn't had any issues. But that is highly dependent on the environment and what else goes on. It is usually just coincidence what part of the app throws that OOM exception if the machine is low on memory already. That being said, I know that the code can be optimized by reusing byte arrays as allocating many small byte arrays can fragment the system memory. I'll have a look into that and might have to roll back to the previous version if I cannot make the allocations more efficient.

Can you give me some more information of what your production environment looks like, hardware wise. And, how many keys do you store and maybe how many keys do change over time? How large are the values on each key (roughly).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MichaCo/CacheManager/issues/183#issuecomment-318587196, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZLXierJVowE91y8heAiwdpHsO5YCEmks5sSZQCgaJpZM4Ol_dW .

vantheshark commented 7 years ago

FYI

Check these photos to see how the hash look like. I think version 0.8.0 created 5 hash entries but the version 1.1.1 read 7 entries and decide the expirationMode incorectly.

The region was null btw. Due to the usecase, our system can not wipe out all the existing cache items. However, we expect those keys will be expired in 30 mins but something happened and the entries are retured and never been deleted.

I think the culprit is likely related to why the hanle couldn't delete the conflicted version.

MichaCo commented 7 years ago

I think the culprit is likely related to why the handle couldn't delete the conflicted version.

Most likely, yes. The update handling also changed to use a version field on the hash instead of passing the old value in every time for comparison.

Again, why do you have to run 0.8.0 (1.5 years old) and 1.1.1 together in one environment? That might cause several issues and is not something I can support as that would require tons more testing/work on my side ;)

vantheshark commented 7 years ago

Nah that happened because we tried to update the app with latest package. There are many existing cache key prior to deployment and unfortunately the problem was found after deployed :).

On Fri, 28 Jul 2017 at 6:37 pm, MichaC notifications@github.com wrote:

I think the culprit is likely related to why the handle couldn't delete the conflicted version.

Most likely, yes. The update handling also changed to use a version field on the hash instead of passing the old value in every time for comparison.

Again, why do you have to run 0.8.0 (1.5 years old) and 1.1.1 together in one environment? That might cause several issues and is not something I can support as that would require tons more testing/work on my side ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MichaCo/CacheManager/issues/183#issuecomment-318595470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZLXuDOR92HKE95uqbQmodhm2MKuMs6ks5sSZ3NgaJpZM4Ol_dW .

vantheshark commented 7 years ago

Found out 1 more clue.

MichaCo commented 7 years ago

Hi @vanthoainguyen first of all, again, the backplane received some breaking changes and is not compatible with 0.8.0. That being said, I also found one bug which causes the OOM exception. I'm actually validating the length of the array to prevent those things, but I allocated the array beforehand, which was wrong. I added some tweaks to the backplane, any "old" messages might now log warnings and will be skipped due to errors parsing them. At least it should never throw OOM exceptions...

MichaCo commented 6 years ago

fixed in release 1.1.2