Closed ErinWeisbart closed 1 month ago
The current metric addition was definitely quick and dirty, so definitely could come up with something better, I'm sure it's just a matter of googling the right SO posts.
A warning though about memory - hopefully, most of the time memory issues aren't misconfiguration issues, but when they ARE, our current workflow can't detect it, so let's be thoughtful about how we do/don't describe amount of "available" memory (link below (Broad only))
We could also explore whether we want to do the actual agent installation as part of DCP - I doubt it, but if it's optional maybe not a terrible idea
If your dockers run out of memory jobs fail silently. It's annoying. Our per-instance logs do regularly print instance metrics that include memory in use and memory available metrics. However, parsing them is annoying.
It would be nice if we could add in a regular print statement into the logs that is human readable and reports memory metrics so that one could more easily determine if memory issues are bonking jobs by browsing logs. Perhaps also include
WARNING
in the statement if it's above a certain threshold so that a CloudWatch dashboard widget could easily report it?