department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
98 stars 68 forks source link

Implement monitoring for Elasticache production instances #4928

Closed acrollet closed 3 years ago

acrollet commented 3 years ago

Acceptance Criteria

olivereri commented 3 years ago
What to Monitor AWS Metric for Memcached Description
memory utilization UnusedMemory The amount of memory not used by data.
network I/O BytesReadIntoMemcached,BytesWrittenOutFromMemcached
disk I/O (IOPS) ? There is no disk ?
CPU utilization CPUUtilization
cache size BytesUsedForCacheItems
hit rate CasHits / CasMisses The number of Cas requests the cache has received where the requested key was found and the Cas value matched divided by the number of Cas requests the cache has received where the key requested was not found.
ElijahLynn commented 3 years ago

@mchelen We are removing "disk I/O" as an AC because it is not applicable as it is an in memory cache. Sizing stays same.

olivereri commented 3 years ago

We should consider following AWS's recommendation for monitoring Elasticache Metrics. We have CPU Utilization on the list already. https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/CacheMetrics.WhichShouldIMonitor.html

image.png

olivereri commented 3 years ago

I'm starting to think that memory utilization may not be an effective metric to track, since there doesn't appear to be a way to easily get and use the total memory for any given node on these AWS managed devices. Evictions seems to be a better alternative. Evictions are unexpired items in the cache that are removed to make room for new items. Depending on the value of the threshold it can indicate that we are running out of capacity.

The FreeableMemory metric, described as The amount of free memory available on the host. This is derived from the RAM, buffers, and cache that the OS reports as freeable. may help somewhat satisfy the memory utilization AC. As that number trends to zero we know we are running out of capacity but won't give us a clean utilization %.

Additionally, tracking the SwapUsage metric will uncover memory capacity issues as Memcache will write to this swap space that's on disk when it runs out. While these don't look like traditional memory utilization metrics that we are accustomed to for EC2 they should work well to highlight cache capacity issues.

I think these two metrics are more appropriate to keep an eye on the health of the cache rather than how we've described the AC.

Edit: There is a computed metric called Fill Percent but is computed using bytes and limit_maxbytes from the memcache service. Since Elasticache is a managed service (no server level access) there's no way to get those metrics sent to Cloudwatch then consumable by Grafana.

olivereri commented 3 years ago

Thinking more about the current ACs, they don't seem appropriate because they are telling me what I should be monitoring. Instead it I think it should be asking me to discover/research what is appropriate to monitor to track the health of the cache and implement it.

olivereri commented 3 years ago

Hitrate can be more easily computed: This is a calculated metric: get_hits / cmd_get. It indicates how efficient your Memcached server is.

https://blog.serverdensity.com/monitor-memcached/

mchelen-gov commented 3 years ago

reminder to add dashboard link to docs

olivereri commented 3 years ago

The dashboard link to add to documentation: Must be on Socks Proxy. http://grafana.vfs.va.gov/d/dxf8a-6Zz/cms-dashboard?orgId=1&refresh=5s Memcache specific panels at the bottom of the page.

mchelen-gov commented 3 years ago

i added an AC about docs, sorry i didn't notice that earlier during grooming