Why is our bandwidth allowance exceeded on Elasticache

mlissner commented 2 years ago

I saw some suspicious metrics in our Elasticache instance today and wrote them up here. I'm not sure what's up with this, but it seems not great and might be part of why we keep getting various connection failures:

https://serverfault.com/questions/1105308/elasticache-bandwidth-usage-is-low-but-bandwidth-allowance-exceeded

mlissner commented 2 years ago

I put this question over on AWS's forum too and got a good response: https://repost.aws/questions/QUv105xDmfQMGUiBbfeYW-iQ/elasticache-shows-network-in-and-out-as-exceeded-but-how

It sort of feels like, yes, maybe we're using a lot of bandwidth, but I'm still wondering why 500Mbps is a lot when the thing is supposed to go "Up to 5Gbps." I added that as a comment on the forum too, we'll see if there's a response.

A couple other observations:

I increased the size of our redis cache to see if that helps us avoid this problem. It'll be twice the price, so hopefully it helps. I'm not sure if this will, but we'll know pretty quickly. Both the size we had and the new one promise the same "Up to 5Gbps" speed, but there's some system of baseline network credits that seems to be a mystery, and I suspect it's related to CPUs, so I think a bump will help. We'll see.
It's not entirely clear that this is a problem, because I haven't tied it to failing stuff, but since we're not using redis purely as a cache, I think it must be. It could be the failures are just being seen by users, which is, eh, OK, but it could also be that it's causing problems elsewhere, like in Celery.
Our traffic (as you can see in the linked q's) is very spikey on the hour and on the half. I don't know what's causing those spikes, but if we can smooth them out, we could reduce our network spikes in a good way, and could probably go back to a smaller cache (with the cheaper price).

Anyway, let's monitor to see if this helps.

mlissner commented 2 years ago

Yeah, scaling up didn't help at all. Probably the next best solution is to figure out what's causing these spikes, I think. I looked around yesterday and couldn't find anything that looked particularly suspicious. It's gotta be some sort of cronjob though, because it's always on the hour and the half. That wouldn't happen for a cache with a 30 minute TTL since such a cache wouldn't be tied to the clock (just to a timer).

mlissner commented 2 years ago

The folks at Datamatics are going to try to get an AWS pro to help diagnose this.

mlissner commented 2 years ago

I'm going to close this for now. We still get spikes in traffic, but I don't think they're causing any issues that we aren't recovering from gracefully.

One last thing to note is that if we want to monitor our traffic in order to figure out what's causing the spikes, AWS does have an option for this: https://aws.amazon.com/blogs/aws/new-vpc-traffic-mirroring/

That option would also be useful for security, as in https://github.com/freelawproject/courtlistener/issues/1586

freelawproject / courtlistener

Why is our bandwidth allowance exceeded on Elasticache #2155