dgraph-io / ristretto

A high performance memory-bound Go cache
https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/
Apache License 2.0
5.54k stars 364 forks source link

[BUG]: Expiration map grows without limit #329

Closed nemorales closed 1 month ago

nemorales commented 1 year ago

What version of Ristretto are you using?

v0.1.1

What version of Go are you using?

1.20

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, CPU, OS)?

16GB RAM, Apple M1 Pro , macOS Monterey 12.5

What steps will reproduce the bug?

We encountered problems with memory usage on production, so I coded a local application and stressed it. Configuration of ristretto:

ristretto.Config{NumCounters: 1000000,MaxCost: int64(100 * bytesInMb),BufferItems: 64,Metrics:true}

We used TTL of an hour (3600* time.Second) and stressed the application to generate 1k items per 10ms, an item uses 200 bytes aprox.

After reaching eviction point, we leave the application stresser running for 10 more min. Using pprof to analyze in use space we see this:

imagen

Expected behavior and actual result.

I would expect the keys added for expiration in the expiration map to be removed once the same key gets evicted.

What it seems to happen is that the key is not removed from the expiration map after eviction, so when using a big enough TTL the memory usage of ristretto knows no limit.

Additional information

No response

Woutifier commented 1 year ago

Hey @nemorales it seems we are encountering this as well. Did you run a garbage collection prior to running pprof? (I'm not 100% sure that it happens automatically).

Looking at the code there a chance for the expiration bucket cleanup to be skipped under load. There is a single goroutine responsible for a number of things, among which cleaning up the "just expired bucket". A timer ticker has a buffer of one, so if the cleanup is not happening fast enough (e.g. due to some type of lock contention, or a heavy OnEvict function), the ticker will skip. This can cause a bucket to not be cleaned up, and it will remain in place until the parent cache is GC'ed.

Unrelated (in terms of cause), but roughly in the same code: clearing the cache also doesn't currently clear the expiryMap, which seems to be an oversight. as it makes no sense to keep it around if the items themselves have been cleared.

Replication is fairly easy by just adding a time.Sleep of some duration (I used 1 millisecond) as the OnEvict function, which will cause the cleaning up of buckets to be too slow to keep up under any kind of load, triggering the behavior.

gramidt commented 1 year ago

@nemorales @Woutifier - Have you continued to experience this behavior? I'm seeing something similar and was curious if either of you found a solution.

Woutifier commented 1 year ago

I'm thinking of these two changes: https://github.com/Woutifier/ristretto/commits/wouter/clear-expiry

Since then, I've found another issue, which is that if you don't set a TTL at all (Just Set, not SetWithTTL), and the item you're setting is already in the cache, it will use the "Update" path. Update doesn't properly filter out items without expiration, and just creates a new "zero time" bucket and adds the item into that.

Edit: added a fix for the new issue to the branch above too.

nemorales commented 1 year ago

@Woutifier that's an interesting finding. Now I understand why ristretto has different memory usage behavior depending on the load: in my use case we decided not to use TTL but there are some cases in which memory usage grows without limit.

I'm taking a look at your commits, thanks for sharing what you learned!

mjoyce91 commented 1 year ago

Is there any new information on this issue?

Woutifier commented 11 months ago

I've opened a PR related to this issue: https://github.com/dgraph-io/ristretto/pull/358

makifbaysal commented 5 months ago

I have a similar issue with TTL. Memory usage is okay when I set the same key in the cache with a lower TTL(like 1 hour). But if I increased the TTL to 1 day, memory usage exceeded 1 GB.

github-actions[bot] commented 2 months ago

This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.