TheCloudlessSky / NHibernate.Caches.Redis

An NHibernate caching provider for Redis.
MIT License
59 stars 39 forks source link

SetOfActiveKeysKey is not cleaned up and hogs cpu/memory over time #21

Open nomaddamon opened 8 years ago

nomaddamon commented 8 years ago

Our situation: Table "A" has ~100.000.000 records, and has rather specific load pattern - if a row is queried then it is likely it will be queried again few thousand times over next few minutes. Around 5000 different rows are queried every minute (distribution is near random). After a few minutes given row might not be accessed again for days.

A logical setup would be to set CacheRegion absolute expiry to ~ 5 minutes and keep only "hot" rows in cache (~ 5 * 5.000 = ~ 25.000 rows in cache).

Since SetOfActiveKeysKey items have no expiry, it will build up over time (items are removed from cache via expiry, but :keys records are kept). With our load pattern, this leads to having 10+m records in :keys set, while having ~25.000 items cached. I'm quite certain this has negative effect on lookup speed (not tested though), and certain about memory consumption - purging the :keys will result in few GB of freed memory

At the moment we use scheduled cleanup as a workaround, but this seems like a hack and shouldn't be part of this project? At the same time I can't figure out how to address this issue cleanly - considering current cache architecture.

(The hack we use is to run a schedule every minute that does full SCAN (preferably on slave), finds :keys entries, does SSCAN on each entry, finds (in memory) obsolete entries and deletes them in batches)

nomaddamon commented 8 years ago

A more common example of the problem would be querying a database object with expiry session.CreateCriteria<Banner>().Add<Banner>(x => x.ExpiresAt > expiryTime).SetCacheable(true).List(); where expiryTime is DateTime.Now rounded to full minute.

Reasoning - every minute we query database for a list of not-expired banners and all other queries hit StandardQueryCache

As a result every minute a new entry is added to NHibernate-Cache:NHibernate.Cache.StandardQueryCache:keys and they are only removed on Redis FLUSHDB

nomaddamon commented 8 years ago

Went ahead and tried a different approach - not to keep track of keys via :keys but to SCAN and DEL them on ICache.Clear(), see https://github.com/nomaddamon/NHibernate.Caches.Redis

This adds considerable overhead to ICache.Clear() (clearing a small region in a huge database might result in 100x preformance loss) but it does come with a few perks:

This might not be OK approach for this project, but in our use case it works better (ICache.Clear() is not used in our solutions since we try to avoid named queries)

TheCloudlessSky commented 8 years ago

Thanks for bringing this up. This will be quite a long read, but hopefully it helps you.

This is a known disadvantage to using this library (and it's not particularly easy to fix).

Prerequisite thoughts

The general idea with this project is to be the lowest common denominator for using Redis as a cache for NHibernate. With that in mind, you may have to customize/fork this library to get the best performance with your own application's constraints (as I'm writing this I see you've already stumbled upon one of the solutions :smile:).

Have you profiled to test that adding Redis cache in front of your database is faster than actually just querying the database directly? This obviously depends on how your data is modeled and how you're using NHibernate. Adding cache on top of NHibernate can actually reduce performance because of serialization/deserialization/materialization costs inside of NHibernate. In large applications I've created (and with specific usage patterns/environments), I've found using the DB directly was as fast as using the cache + full round trips. Obviously this highly depends on your DB, environment, data etc.

Do you need a distributed NHibernate cache (e.g. are you using only one web server)?

Have you considered using a non-ORM for high-performance reads (Dapper, Massive etc)? For my applications, we usually start with NHibernate (to get going quickly). As we find performance-sensitive areas, we use raw SQL queries with Dapper to avoid all of the costs associated with NHibernate (and extra round-tripping to the cache). We use MediatR to abstract away the implementation of these queries to consumers.

If you're using the NHibernate query cache, be aware that there are broad cases where an updated will cause NHibernate to call ICache.Clear(). This, again, depends on how your using NHibernate, but I've found that in situations like this, the NHibernate query cache has no real benefit, so we leave it off completely. We then move the caching "up" (e.g. cache rendered HTML or JSON rather than actual DB entities).

Potential solutions

  1. Turn on keyspace notifications in Redis and listen for "expire" events in your application code. When each key is expired, remove it from the set of active keys (you may have race conditions, however).
    • Because this isn't enabled in Redis by default, this isn't something we can easily bake into this library for everyone to use.
    • It uses additional CPU in Redis.
    • What happens if you have more than one listener (e.g. multiple web servers would receive the keyspace notifications and both do the remove from the set of active keys).
  2. Use your external timer to "prune" old entries.
    • If we baked this into this library, it'd have to be environment-agnostic (e.g. HostingEnvironment.QueueBackgroundWorkItem is only useful in web scenarios).
    • Instead we could expose a RedisCache.Prune() so that you can determine how often you'd want to prune the entries.
    • The actual pruning, however, really depends on your use case. For example, SSCAN is a server operation, not a full database operation.
    • How many "N" do we query at one time?
    • Should we always target a slave or the master (or use PreferSlave)?
    • There are lots of questions that make Prune() less general and not a good fit to add to this library (however I'll still consider it if we can nail down a great implementation).
    • Could we just publicly expose the RedisNamespace so you can get direct access to the key formatting (and then write your own Prune())?
  3. Fork the library, manually do a SCAN/KEYS in ICache.Clear(), and don't track the set of active keys.
    • Edit: Just saw that you stumbled upon this!
    • This was one of the original ways this was implemented (by other libraries that inspired this one). The, reason for not doing this is because of normalizing performance.
    • Is usually very specific and has really bad performance.
    • It has potential race conditions.
    • NHibernate can call ICache.Clear() at different times (e.g. executing native SQL, manually evicting a whole collection/region/type) so you have to be careful about performance.
    • We could think about turning this into a boolean option (turned off by default) for each ICache and bake it into this library. I'd want it to be robust enough. We'd have to take into account that SCAN/KEYS are server (not database) operations.
  4. For the library and use "generations" (previous versions of this library used this approach).
    • Bake a "generation" into the key name: NHibernate-Cache:<region>:<generation>:<key>
    • Every ICache has to sync their generation with the other servers (meaning we have to store the generation in Redis and make sure they all update when the generation changes).
      • With versions <= 2, we synced the generation with every operation... so it was more commands hitting Redis than how it was changed for v3.
    • Calling ICache.Clear() would be as simple as incrementing the generation and letting everyone else know the generation has been incremented (e.g. you could use Redis pub/sub to get notified and also sync your generation on startup).
    • There's a chance of a race condition since your ICache might have an out-of-date generation (maybe by a few seconds).
    • You'd need a LRU Redis policy and ensure that all NHibernate cache regions use a decent expiration (so they expire when they're no longer used once the generation is updated).
    • This is a good lowest common denominator for most applications, but you have to keep generation syncing in mind.
  5. Fork the library and use a separate server/database for each of your NHibernate cache regions.
    • If you have more than 15 regions mapping to individual databases, you'd need to change the Redis config.
    • Map from region name to database number.
    • Won't need the set of active keys.
    • Clear would be as simple as a FLUSHDB for the specific server/database pair.
    • This is one of the best approaches to scale NHibernate cache. For example, you could slice your Users table cache off to its own dedicated server for even better performance.
    • We can't make this assumption for all applications.

Additional thoughts

Thanks!

TheCloudlessSky commented 8 years ago

You should be able to improve your Clear() by using the IDatabase.KeyDelete(RedisKey[] keys) overload instead of doing a batch:

db.KeyDelete(new RedisKey[] { "key1", "key2", "keyN" }, CommandFlags.FireAndForget);
nomaddamon commented 8 years ago

Thanks for long and informative reply! :)

First the plan:

Additional thoughts:

About our setup: we use a bunch of web servers (16 as of writing, but this can change in minutes) load balanced by haproxy. Platform is a mix of WebForms and MVC, ORM is done by NHibernateX (NH for .NET 4.5 with async support). We used Memcached for quite a while successfully, but moved to Redis due to more reasonable failover architecture. The data-structure in our DB is rather complex and some queries (which can be frequent) are really complex. Most of the data is slowly changing, a random entity will probably get a few thousand hits before cache expiry / invalidation. Moving further away from relational data-models is the way forward for scaling, but would mean a huge rewrite and that isn't planned at the moment.

Without caching, we would need to upgrade our relational DB performance by a factor of 10 to keep current page latency, so yes, we need cache (and since we have multiple servers we need a distributed cache)

Moving some queries to Dapper or similar doesn't make much sense, since relational DB is the hardest thing to scale out in a hurry and cost's the most to massively over-provision.

We have been using query cache for a few years with NHibernate and in our case it makes a huge impact on performance. (On a side-note: are you sure about query-cache and ICache.Clear() - we haven't seen this in our logs and i can't trace it down from NH source)

We do use multiple caching levels, depending on which point the result becomes predictable (and re-usage is likely)

About potential solutions

  1. don't seem reasonable due to increased load
  2. don't seem reasonable due to increased load (we do use it in production at the moment until we get 3. through QA) SSCAN/SREM do slow down SISMEMBER on the same key noticeably.
  3. we found this to be most reasonable for our workload. I will try to bake it into this project as a configurable option
  4. seem reasonable approach, but generation syncing seems a hassle (as I understand multiplexer can reconnect to master after momentary network failure without end-user noticing it, while disconnect caused unsubscribe in Redis) or performance heavy (like you did it in v2)
  5. doesn't work - we'd need a separate Redis instance/server for each region. If FLUSHDB locks (i.e. deleting a :keys set with 10m entries) it locks up the whole Redis instance, not just given DB. SHUTDOWN NOSAVE on an instance can "flush" it fast - but having a Redis instance for each region seems a rather peculiar architecture
nomaddamon commented 8 years ago

Please review this branch: https://github.com/nomaddamon/NHibernate.Caches.Redis/tree/disableIndexSetOnKeys and let me know what to improve in order to get it merged

TheCloudlessSky commented 8 years ago

Regarding ICache.Clear(), it's not entirely obvious where it can be called, but if you do a "find all references" and work backwards you'll see at least the following places:

Here's my thoughts:

Also, I'm curious, how are you serializing cache objects (JSON, XML, etc)?

nomaddamon commented 8 years ago

Unrelated stuff first

We are at the moment using default serializer. We played around with different options and found it to be acceptable (computing power for serializing is easy to scale - and xml storage overhead in Redis is negligible (bandwidth and latency-wise))

Somewhat related stuff

We use CI and can basically do A-B testing (this is a coincidence of our deployment procedure - UpdateTimestampsCache is in one Redis database for all servers, but when deploying, new servers start using a new database to avoid conflicts in data object definitions - this gives us guaranteed non-stale cache, even if some servers are on version X and others are on X+1 (where data structures are different)). Last few weeks we have tested quite a lot in production :) - basically deploying 10-20% of servers with some POC code (that has passed QA) to see how it affects Redis performance. We used scheduled cleanup for a while but it creates quite a lot of load (compared to minimizing ICache.Clear() usage and not using index set).

Most of ICache.Clear() usages come from usages we consider bad practices - hardcoding SQL/HQL queries, using named queries etc. (and yes, named query with no result type / sync set will call ICache.Clear() on all regions). We log all ICache.Clear() calls as errors...

Related stuff

Please see commit https://github.com/nomaddamon/NHibernate.Caches.Redis/commit/a0df68f246e744d04b7c650bfe5a1314b7d200d4 and run the new tests on your own machine :)

Test logic: Set up a region with ~480k expired keys and ~20k active keys (we wait a bit to make sure Redis cleans up most of expired keys) - 20k active keys expire after the test finishes so we are only testing Get/Put/Clear

Use 5 paralel tasks for testing: 2 tasks simulate cache-miss (GET-PUT) to load-test region 2 tasks simulate cache-miss (GET-PUT) to unrelated region 1 task calls Clear while others are running

You might have to change parameters, depending on you test machine, but with defaults the tests should run at about ~50s combined. There are no Asserts, console output only (the code is not meant to be part of this project and is not formatted according to your style (sorry :) resharper took over at some places))

My local output for Clear_preformance_without_index Setup took 22156 Test db keyspace: [db15, keys=22593,expires=22593,avg_ttl=3717] Read-Write started Read-Write started Read-Write started Read-Write started Clear starting Clear finished, took 442 Read-Write ended Read-Write ended Read-Write ended Read-Write ended So... 22.5 unexpired keys (2.5k should expire while test is running) and 442ms clear - ok result if you dont use Clear :)

My local output for Clear_preformance_with_index Setup took 20831 Test db keyspace: [db15, keys=20002,expires=20001,avg_ttl=816] Read-Write started Read-Write started Read-Write started Read-Write started Clear starting Clear finished, took 11 Read-Write took 474ms on region1 Read-Write took 472ms on region1 Read-Write took 471ms on region2 Read-Write took 475ms on region2 Read-Write ended Read-Write ended Read-Write ended Read-Write ended Multiplexer is already warmed up so this round there was only 1 keys that would expire while testing.. clear took 11ms (due to FireAndForget) but LOCKED up ENTRIE redis for 470+ms (all regions, all databases)

Play around with parameters a bit and you will be able to get timeout exceptions instead of 400ms read-writes (for me it took 1.5m expired and 50k active keys)

This might not seem that bad, but here we were running at 80 ops per second (at best case) and used a rather small cache... we see 150k ops per second in peak load times... The problem is that if Redis gets locked up, sentinels start failover and the current master gets demoted - meaning that slave will resume from last known good state, which was prior to clear.

So my point is, that you are not adding an option to potentially make ICache.Clear() unsafe, but that it already is very unsafe under load (under our load calling ICache.Clear() will result in ~500 requests delayed by 1 second (failover-timeout) while having average page load time of 50ms, ~500 error messages and a lot of work figuring out what happened... not calling ICache.Clear() will result in Redis DB growth of 1-2GB / day)

TheCloudlessSky commented 8 years ago

About memory

Running your load tests

Our setup to run your tests was with a local Windows build of Redis that did not have any slaves. We found, obviously, that the non-index ICache.Clear() with SCAN was slower, but that makes sense because the SCAN is done on the master and not a slave. We added a slave, and saw performance improvements.

I found some variables were inconsistent between tests (e.g. changing the timeout). I wrote the same tests from scratch to fully understand your load tests.

Our results were similar in that calling DEL on the set of active keys can pause other connections for a few hundred milliseconds when it contains lots of values. This makes sense because, as of Redis 3.2, it uses synchronous deletes. Coming in Redis 3.4 will be the UNLINK command which will enable asynchronous deletes. Eventually, we'll be able to include this in our current implementation depending on what your server supports.

Safety (at least strictly for this library) is defined in terms of having atomic operations. We need ICache.Clear() to be atomic for the general use case of this library. I don't think it's a good idea to have mixed safety (even as a config option because it duplicates the complexity of this library). Our current implementation is slower, but guarantees that ICache.Clear() is atomic. I'd really like to keep this library as simple as possible so that everyone is free to customize it to their use cases. You've done this perfectly where having lots of keys has proven you need to take the library in a different direction and are able to sacrifice atomicity for performance.

I also wanted to note that having a separate Prune() that does a SSCAN on the set of active keys and asynchronously removes expired keys from the set will definitely improve performance of ICache.Clear() since it doesn't have to free up as much memory.

If forking this library is a pain, I've also thought about introducing something like an IRedisCacheStrategy where you'd be able to customize exactly how the values are stored/retrieved/cleared in Redis. It would take some thought to be generalized, but I'm open to adding that to reduce the pains (compilation/maintenance) of your own fork.

nomaddamon commented 8 years ago

Thanks again for detailed answer with a lot of insight. We have switched to your JSON.NET serializer and saw ~30% memory gain for Redis cluster :)

With some work you could make it the default serializer and offer NetDataContractSerializer as an alternative. It does have some issues at the moment, but nothing major, iv'e added some issues in comments to that file (they are also the reason we didn't start with that serializer)

We are fine with maintaining our own fork until (and if) we can make this project support both current (guaranteed atomicity) and our (high load, many keys, almost no Clear()) scenarios.

Some thoughts on alternative stategies:

TheCloudlessSky commented 8 years ago

I'm not sure why, but GitHub doesn't send email notifications for Gists so I'm glad you mentioned it here. As I'm sure you're aware, I responded.

I've also merged in #22 (and cleaned it up to fit our code base). Thanks very much for your contribution!