Enabled Cloudwatch Logs input with a disabled stream causes all Cloudwatch Logs integrations to stop working after agent restart

keiransteele-phocas commented 4 months ago

This relates to support case #01619569 which I have been able to reproduce using the steps below.

Problem

We were having issues where all logs from Cloudwatch that were collected by agents running in AWS would stop at the same time. There were no errors in the logs or diagnostics to indicate the issue. When disabling recently added integrations, the problem would resolve itself. The integrations that were disabled were similar apart from the stream on the input being disabled. Simply creating an integration with the stream disabled doesn't cause the issue, the agent also needs to be then restarted.

Background

We have Elastic agents running in AWS Fargate which are dedicated to retrieving logs out of AWS services all running the same policy. There are about 40 integrations for Cloudwatch logs due to a combination of AWS accounts and regions. We tried to improve the developer experience for adding integrations by providing a Terraform module that created the integration for them using the elasticstack_fleet_integration_policy resource. The Terraform resource in the module turned out to be misconfigured with the input enabled and stream disabled, it's not possible to configure it like this in the Fleet UI but it is possible via the API.

AWS Cloudwatch Integration Configurations

Terraform Resource

```hcl input { enabled = true input_id = "cloudwatch-aws-cloudwatch" streams_json = jsonencode({ "aws.cloudwatch_logs" : { "enabled" : false, "vars" : { "start_position" : "beginning", "api_timeput" : "120s", "processors" : "", "scan_frequency" : "1m", "tags" : [ "forwarded", "aws-cloudwatch-logs", "${local.account_name[0]}" ], "log_group_arn" : "arn:aws:logs:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:log-group:${var.log_group_path}/${var.log_group_name}:*", "api_sleep" : "200ms", "data_stream.dataset" : "aws_${local.log_type[0]}_${local.environment_type}", "log_streams" : [], "preserve_original_event" : false } } }) } ```

Elastic Console API Request (truncated)

```json POST kbn:/api/fleet/package_policies { "policy_id": "", "package": { "name": "aws", "version": "2.15.2", "experimental_data_stream_features": [] }, "name": "cloudwatch-log-test-failure", "description": "", "namespace": "default", "inputs": { "cloudwatch-aws-cloudwatch": { "enabled": true, "streams": { "aws.cloudwatch_logs": { "enabled": false, "vars": { "log_group_arn": "arn:aws:logs:us-west-2::log-group:/elastic/cloudwatch/failure-test:*", "log_streams": [], "start_position": "beginning", "scan_frequency": "1m", "api_timeput": "120s", "api_sleep": "200ms", "tags": [ "forwarded", "aws-cloudwatch-logs" ], "preserve_original_event": false, "data_stream.dataset": "cloudwatch_log_failure_testing" } } } } "vars": { "access_key_id": "", "default_region": "" } } ```

Reproduction Steps

Fresh Elastic Agent running on EC2 instance with latest Amazon Linux 2 AMI
Create two log groups:
1. The first one is to push logs into to show the failure
2. The second one didn't have anything in it but is just to have a different log group for the second integration
3. I named the log groups /elastic/cloudwatch/failure-test-logs and /elastic/cloudwatch/failure-test

Add an IAM policy to the EC2 instance profile to allow it to retrieve logs

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "logs:DescribeLogGroups",
            "ec2:DescribeRegions"
        ],
        "Resource": "*"
    },
    {
        "Effect": "Allow",
        "Action": [
            "logs:GetLogEvents",
            "logs:FilterLogEvents"
        ],
        "Resource": [
            "arn:aws:logs:us-west-2::log-group:/elastic/cloudwatch/failure-test*",
            "arn:aws:logs:us-west-2::log-group:/elastic/cloudwatch/failure-test*:log-stream:*"
        ]
    }
]
}

Create a clean Fleet Policy with only system logging and metrics enabled
Add an AWS Integration to the new Fleet Policy with only Cloudwatch logs enabled. I named the first integration cloudwatch-log-test-logs, no auth settings are configured as it will use the instance profile, fill out the log group ARN and set a dataset name which is optional but I set it to cloudwatch_log_failure_testing to make the logs easier to find and cleanup.
Preview and copy the API request before saving the integration
You should now be able to add logs to the new log group which I just did through the AWS console and view them in Discover with the dataset name.
Copy the API request copied earlier into the Elastic Dev Tools Console. Modify the name and ARN of the Cloudwatch log group.

Toggle the streams.enabled variable from true to false

  "streams": {
    "aws.cloudwatch_logs": {
      "enabled": false
    }
 }

Send the request
You should now see another integration on the policy with the Cloudwatch Logs input enabled and the stream toggle below it disabled.
Add some more logs to the first log group to verify they're still coming through.
Forcefully restart the agent, I used sudo systemctl restart elastic-agent
Add more logs to the first log group, they will no longer be collected

It appears that Cloudwatch input is not able to start after the agent is restarted.

You can see in the logs the messages below each minute when the log collection occurs but after the restart the messages are not present

{"log.level":"info","@timestamp":"2024-05-21T23:37:04.289Z","message":"aws-cloudwatch input worker for log group: '/elastic/cloudwatch/failure-test-logs' has started","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-cloudwatch-default","type":"aws-cloudwatch"},"log":{"source":"aws-cloudwatch-default"},"log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.line":204,"file.name":"awscloudwatch/input.go","function":"github.com/elastic/beats/v7/x-pack/filebeat/input/awscloudwatch.(*cloudwatchInput).Receive.func1"},"service.name":"filebeat","id":"aws-cloudwatch-aws.cloudwatch_logs-aae83b0b-9bbc-439d-919d-577e7ff92e9d","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-05-21T23:37:04.563Z","message":"aws-cloudwatch input worker for log group '/elastic/cloudwatch/failure-test-logs' has stopped.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-cloudwatch-default","type":"aws-cloudwatch"},"log":{"source":"aws-cloudwatch-default"},"log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.line":200,"file.name":"awscloudwatch/input.go","function":"github.com/elastic/beats/v7/x-pack/filebeat/input/awscloudwatch.(*cloudwatchInput).Receive.func1.1"},"service.name":"filebeat","id":"aws-cloudwatch-aws.cloudwatch_logs-aae83b0b-9bbc-439d-919d-577e7ff92e9d","ecs.version":"1.6.0","ecs.version":"1.6.0"}

There are no logs from what I can tell that indicate why the input is unable to start or why one misconfigured input would cause all the other inputs to fail.

I will attach diagnostics from my reproduction to the support case. I initiated the restart of the agent to induce the failure at "@timestamp":"2024-05-21T23:42:08.611Z"

agithomas commented 4 months ago

Pinging @elastic/obs-ds-hosted-services

kaiyan-sheng commented 4 months ago

Hi @keiransteele-phocas did you try disable the input and the stream as well? I was able to reproduce this problem with running the API to disable the aws.cloudwatch_logs stream. Once I disable the input and restarted the agent, logs were able to ingest again.

keiransteele-phocas commented 3 months ago

Hi @keiransteele-phocas did you try disable the input and the stream as well? I was able to reproduce this problem with running the API to disable the aws.cloudwatch_logs stream. Once I disable the input and restarted the agent, logs were able to ingest again.

Hi @kaiyan-sheng, yes I did disable the input and that does fix the issue. I believe it's a bug when having an enabled input with a disabled stream causes all other integrations with the same input to fail.

kaiyan-sheng commented 3 months ago

@keiransteele-phocas Thank you for confirming that disabling the input fix the issue and I agree this behavior is a bug. In my opinion, either we should not allow the input to be enabled when the only stream underneath is disabled. Or this behavior should not affect the same inputs to fail.

Will comment back once I find out which team should own this issue.

jsoriano commented 3 months ago

In my opinion, either we should not allow the input to be enabled when the only stream underneath is disabled. Or this behavior should not affect the same inputs to fail.

Will comment back once I find out which team should own this issue.

I guess this is us :slightly_smiling_face:

@kpollich what do you think about this proposal? The API should probably return some kind of 4xx error when having this kind of incoherences between disabled data streams and inputs.

kpollich commented 3 months ago

The API should probably return some kind of 4xx error when having this kind of incoherences between disabled data streams and inputs.

+1 from me. Adding to our board.

elasticmachine commented 3 months ago

Pinging @elastic/fleet (Team:Fleet)

criamico commented 2 months ago

I opened a PR, and I see several tests failing (not owned by ingest). I fear that this could be a breaking change for some users that use the update and create endpoints with inputs enabled/disabled in any combination. @kpollich should we mark it as breaking? I'm also not sure about backporting.

kpollich commented 2 months ago

This validation does seem like it could be considered breaking, yes. Seems like the tests we have are a legitimate example where we'd introduce a breakage into currently supported usage of this API.

Is there a way we can do this without validation instead, to avoid the breaking change? Could we automatically disable the parent input when all its streams are disabled in the API? e.g.

PUT kbn:/api/fleet/package_policies/84ede2c9-80a0-4f01-b353-0f8ef873faaf
{
    "inputs": [
    {
      "type": "aws-cloudwatch",
      "policy_template": "cloudwatch",
      "enabled": true,
      "streams": [
        {
          "enabled": false,
          "data_stream": {
            "type": "logs",
            "dataset": "aws.cloudwatch_logs"
          }
        }
      ]
    }
  ]
}

This API request would set inputs[0].enabled to false automatically since all of its child streams are disabled. I think this fixes the bug in question without introducing a breaking change, as an input with no enabled streams is functionally useless and shouldn't be relied upon for any sort of behavior.

criamico commented 2 months ago

I updated my PR to "switch off" the input when all its streams are disabled as discussed above. I left a comment in the PR as I'm found a problem with the cloud security posture tests, it seems that this change would breaking at least for them.

elastic / integrations