Closed vhartikainen closed 8 years ago
Closing as a duplicate of https://github.com/aws/amazon-ecs-agent/issues/488, which is tracking the fix.
@samuelkarp oh... seems like for the root cause there is a fix, thanks :) didn't see it yesterday..
But for the logging... Are you sure the disks are not eventually filled with ecs-agent logs, given that the agent keeps running long enough? To me the log management / rotation doesn't seem bulletproof.
@vhartikainen Are you still having problems with excessive logging/log size after updating to 1.12.1? The rotation settings are defined here.
@samuelkarp Updating our clusters at the moment, so don't yet have experience about it. I found that section of code, but what is your suggestion? Build my own version with modified logging settings? I have to admit, that I don't do development work much these days, but to me it seems like a hardcoded setting there:
Haven't tried, but should that json-file output be possible to be rotated by putting proper options under: /etc/sysconfig/docker
For convenience of using ECS, I think these should be somehow able to be passed to ecs-agent / docker daemon without ssh to the ecs instance or pushing the settings via CloudFormation UserData or such.
These settings have seemed to work well for most cases, but we're open to changing things if they ultimately are problematic for you. Except for bugs that we've attempted to fix, the logs generally shouldn't be overwhelmingly chatty.
Haven't tried, but should that json-file output be possible to be rotated by putting proper options under: /etc/sysconfig/docker
Yes, you should be able to control the default log driver/log options applied to all containers by editing the options set on the Docker daemon. You can apply these settings with user data, but it's a little tricky since Docker starts before the user data executes. See https://github.com/aws/amazon-ecs-agent/issues/336#issuecomment-198026978 for an example using a #cloud-boothook
or https://github.com/aws/amazon-ecs-agent/issues/464#issuecomment-237093726 for an example of a #cloud-boothook
combined with a regular script using MIME/Multi-Part.
We had our ecs-agent spewing out these errors last night: {"log":"2016-08-22T02:54:32Z [WARN] Error retrieving stats for container cf0a0c6a480e7d2c6d8eea252b50a9224327786e785e4a72f573ae826cfcdd5e: dial unix /var/run/docker.sock: socket: too many open files\n","stream":"stdout","time":"2016-08-22T02:54:32.600266262Z"}
After couple of hours our root volume was filled with logs.
In postmortem analysis (after mounting the volumes to a responsive EC2 instance): Seems like it's not only the json-file container log (found under /var/lib/docker/containers/685b44ac8601c26c3777962cfe4f1715b7c0bcc46ab6a16cbc3970c7f1236679/685b44ac8601c26c3777962cfe4f1715b7c0bcc46ab6a16cbc3970c7f1236679-json.log) that was ~4GB of size, but also the logs are duplicated to /var/log/ecs/ecs-agent.log by default.
What is really bad is that neither of the log files are rotated by size and will then rapidly fill up the disk during such bad outbreak of errors.
Seems like the seelog is configured to rotate the file hourly and with max rotations set to 24, but it really doesn't keep up with rapid pace of logging we saw last night: https://github.com/aws/amazon-ecs-agent/blob/304495bc6defdcc98f15a97b271c41127dba5e48/agent/logger/seelog_config.go#L20-L23
Why ASG didn't just terminate the instance and launch a new one is then another question still remaining a mystery to me...
Improvement suggestion: