ecs-agent logging - fills up root volume rapidly in certain error cases

vhartikainen commented 8 years ago

We had our ecs-agent spewing out these errors last night: {"log":"2016-08-22T02:54:32Z [WARN] Error retrieving stats for container cf0a0c6a480e7d2c6d8eea252b50a9224327786e785e4a72f573ae826cfcdd5e: dial unix /var/run/docker.sock: socket: too many open files\n","stream":"stdout","time":"2016-08-22T02:54:32.600266262Z"}

After couple of hours our root volume was filled with logs.

In postmortem analysis (after mounting the volumes to a responsive EC2 instance): Seems like it's not only the json-file container log (found under /var/lib/docker/containers/685b44ac8601c26c3777962cfe4f1715b7c0bcc46ab6a16cbc3970c7f1236679/685b44ac8601c26c3777962cfe4f1715b7c0bcc46ab6a16cbc3970c7f1236679-json.log) that was ~4GB of size, but also the logs are duplicated to /var/log/ecs/ecs-agent.log by default.

What is really bad is that neither of the log files are rotated by size and will then rapidly fill up the disk during such bad outbreak of errors.

Seems like the seelog is configured to rotate the file hourly and with max rotations set to 24, but it really doesn't keep up with rapid pace of logging we saw last night: https://github.com/aws/amazon-ecs-agent/blob/304495bc6defdcc98f15a97b271c41127dba5e48/agent/logger/seelog_config.go#L20-L23

Why ASG didn't just terminate the instance and launch a new one is then another question still remaining a mystery to me...

Improvement suggestion:

enable seelog to be configured with max size log files (I guess the logging to under /var/log can be disabled with ECS_LOGFILE="" completely already now?)
enable the json-file output by ecs-agent container to be capped with max-size (https://docs.docker.com/engine/admin/logging/overview/#/json-file-options)

samuelkarp commented 8 years ago

Closing as a duplicate of https://github.com/aws/amazon-ecs-agent/issues/488, which is tracking the fix.

vhartikainen commented 8 years ago

@samuelkarp oh... seems like for the root cause there is a fix, thanks :) didn't see it yesterday..

But for the logging... Are you sure the disks are not eventually filled with ecs-agent logs, given that the agent keeps running long enough? To me the log management / rotation doesn't seem bulletproof.

samuelkarp commented 8 years ago

@vhartikainen Are you still having problems with excessive logging/log size after updating to 1.12.1? The rotation settings are defined here.

vhartikainen commented 8 years ago

@samuelkarp Updating our clusters at the moment, so don't yet have experience about it. I found that section of code, but what is your suggestion? Build my own version with modified logging settings? I have to admit, that I don't do development work much these days, but to me it seems like a hardcoded setting there:

you can either turn off logging under /var/log completely
or are stuck with the hardcoded rotation values (file rotated once an hour)

Haven't tried, but should that json-file output be possible to be rotated by putting proper options under: /etc/sysconfig/docker

For convenience of using ECS, I think these should be somehow able to be passed to ecs-agent / docker daemon without ssh to the ecs instance or pushing the settings via CloudFormation UserData or such.

samuelkarp commented 8 years ago

These settings have seemed to work well for most cases, but we're open to changing things if they ultimately are problematic for you. Except for bugs that we've attempted to fix, the logs generally shouldn't be overwhelmingly chatty.

Haven't tried, but should that json-file output be possible to be rotated by putting proper options under: /etc/sysconfig/docker

Yes, you should be able to control the default log driver/log options applied to all containers by editing the options set on the Docker daemon. You can apply these settings with user data, but it's a little tricky since Docker starts before the user data executes. See https://github.com/aws/amazon-ecs-agent/issues/336#issuecomment-198026978 for an example using a #cloud-boothook or https://github.com/aws/amazon-ecs-agent/issues/464#issuecomment-237093726 for an example of a #cloud-boothook combined with a regular script using MIME/Multi-Part.

aws / amazon-ecs-agent

ecs-agent logging - fills up root volume rapidly in certain error cases #504