aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
444 stars 202 forks source link

Agent exists with code 1 instead of panic when configuration validation phase fails #859

Closed rawahars closed 3 months ago

rawahars commented 1 year ago

Describe the bug We are using the following to install Amazon CloudWatch agent on Windows hosts as specified in the Amazon CloudWatch docs. The following command is used-

& "C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1" -a fetch-config -m ec2 -s -c file:configuration-file-path

This script registers the CloudWatch agent as a Windows Service here. Ideally, whenever the agent crashes, Windows Service Manager (WSM) should restart the same. We assume that was the original intention and it works if the agent actually does crash.

In our use-case, we are running the same on an EC2 instance with the region being used in the config for the agent. However, when the instance boots up, IMDS is not available for few reasons. This causes the agent to assume that it is running in OnPrem environment and therefore it exits with code 1.

Since the agent stops with code 1, WSM assumes that the application stopped by itself and therefore, it never restarts the same. We think that the correct action would be for agent to exit with panic whenever there is any non-recoverable failure.

The logs we see are-

2023-09-21T05:56:23Z D! cloudwatch: publish routine receives the shutdown signal, exiting.
2023/09/21 16:08:10 I! D! [EC2] Found active network interface
E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available.
I! Detected the instance is OnPremise
2023/09/21 16:08:10 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json ...
C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2023/09/21 16:08:10 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\file_config.json ...
2023/09/21 16:08:10 I! Valid Json input schema.
Got Home directory: C:\Users\Administrator
I! Set home dir windows: C:\Users\Administrator
I! SDKRegionWithCredsMap region:  
Got Home directory: C:\Users\Administrator
2023/09/21 16:08:10 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:08:10 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:08:10 Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem
2023/09/21 16:08:10 Configuration validation first phase failed. Agent version: 1.0. Verify the JSON input is only using features supported by this version.

2023/09/21 16:08:10 I! Return exit error: exit code=1
2023/09/21 16:08:10 E! Cannot translate JSON, ERROR is exit status 1 
2023/09/21 16:09:21 I! D! [EC2] Found active network interface
E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available.
I! Detected the instance is OnPremise
2023/09/21 16:09:21 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json ...
C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2023/09/21 16:09:21 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\file_config.json ...
2023/09/21 16:09:21 I! Valid Json input schema.
Got Home directory: C:\Users\Administrator
I! Set home dir windows: C:\Users\Administrator
I! SDKRegionWithCredsMap region:  
Got Home directory: C:\Users\Administrator
2023/09/21 16:09:21 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:09:21 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:09:21 Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem
2023/09/21 16:09:21 Configuration validation first phase failed. Agent version: 1.0. Verify the JSON input is only using features supported by this version.

2023/09/21 16:09:21 I! Return exit error: exit code=1
2023/09/21 16:09:21 E! Cannot translate JSON, ERROR is exit status 1 

Steps to reproduce

What did you expect to see? We expected that Windows Service Manager would try to restart the CloudWatch Agent service.

What did you see instead? We saw in the CloudWatch Agent logs that the agent never restarted.

What version did you use? Version:

What config did you use?

{
  "agent": {
    "debug": true
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "C:\\ProgramData\\containerd\\root\\panic.log*",
            "log_group_name": "containerd",
            "log_stream_name": "{instance_id}/containerd-daemon-panic",
            "timezone": "UTC"
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "Default",
    "append_dimensions": {
      "ImageId": "${aws:ImageId}",
      "InstanceId": "${aws:InstanceId}"
    },
    "aggregation_dimensions": [
      [
        "InstanceId"
      ],
      []
    ],
    "metrics_collected": {
      "LogicalDisk": {
        "measurement": [
          {
            "name": "% Free Space",
            "unit": "Percent"
          }
        ],
        "resources": [
          "/",
          "C:\\ProgramData\\containerd"
        ]
      },
      "Memory": {
        "measurement": [
          {
            "name": "Available MBytes",
            "unit": "Megabytes"
          }
        ]
      },
      "statsd": {
        "metrics_aggregation_interval": 30,
        "metrics_collection_interval": 10,
        "service_address": ":8125"
      },
      "procstat": [
        {
          "exe": "containerd",
          "measurement": [
            "cpu_usage",
            "memory_rss"
          ]
        }
      ]
    }
  }
}

Environment OS: Windows Server 2019 and Windows Server 2022

jefchien commented 1 year ago

Hi @rawahars,

Thanks for reporting this issue. One workaround for the delayed IMDS availability on start up is to set the newly available imds_retries section (see https://github.com/aws/amazon-cloudwatch-agent/issues/803#issuecomment-1749342400), which can potentially allow the agent to retry during start up until IMDS is up.

Changing the translator to panic instead of exiting with an exit code of 1 is a behavior change that can potentially impact existing customers in unexpected ways.

github-actions[bot] commented 7 months ago

This issue was marked stale due to lack of activity.

github-actions[bot] commented 3 months ago

Closing this because it has stalled. Feel free to reopen if this issue is still relevant, or to ping the collaborator who labeled it stalled if you have any questions.