Closed keiransteele-phocas closed 2 months ago
Pinging @elastic/obs-ds-hosted-services
Hi @keiransteele-phocas did you try disable the input and the stream as well? I was able to reproduce this problem with running the API to disable the aws.cloudwatch_logs
stream. Once I disable the input and restarted the agent, logs were able to ingest again.
Hi @keiransteele-phocas did you try disable the input and the stream as well? I was able to reproduce this problem with running the API to disable the
aws.cloudwatch_logs
stream. Once I disable the input and restarted the agent, logs were able to ingest again.
Hi @kaiyan-sheng, yes I did disable the input and that does fix the issue. I believe it's a bug when having an enabled input with a disabled stream causes all other integrations with the same input to fail.
@keiransteele-phocas Thank you for confirming that disabling the input fix the issue and I agree this behavior is a bug. In my opinion, either we should not allow the input to be enabled when the only stream underneath is disabled. Or this behavior should not affect the same inputs to fail.
Will comment back once I find out which team should own this issue.
In my opinion, either we should not allow the input to be enabled when the only stream underneath is disabled. Or this behavior should not affect the same inputs to fail.
Will comment back once I find out which team should own this issue.
I guess this is us :slightly_smiling_face:
@kpollich what do you think about this proposal? The API should probably return some kind of 4xx error when having this kind of incoherences between disabled data streams and inputs.
The API should probably return some kind of 4xx error when having this kind of incoherences between disabled data streams and inputs.
+1 from me. Adding to our board.
Pinging @elastic/fleet (Team:Fleet)
I opened a PR, and I see several tests failing (not owned by ingest). I fear that this could be a breaking change for some users that use the update and create endpoints with inputs enabled/disabled in any combination. @kpollich should we mark it as breaking? I'm also not sure about backporting.
This validation does seem like it could be considered breaking, yes. Seems like the tests we have are a legitimate example where we'd introduce a breakage into currently supported usage of this API.
Is there a way we can do this without validation instead, to avoid the breaking change? Could we automatically disable the parent input when all its streams are disabled in the API? e.g.
PUT kbn:/api/fleet/package_policies/84ede2c9-80a0-4f01-b353-0f8ef873faaf
{
"inputs": [
{
"type": "aws-cloudwatch",
"policy_template": "cloudwatch",
"enabled": true,
"streams": [
{
"enabled": false,
"data_stream": {
"type": "logs",
"dataset": "aws.cloudwatch_logs"
}
}
]
}
]
}
This API request would set inputs[0].enabled
to false
automatically since all of its child streams are disabled. I think this fixes the bug in question without introducing a breaking change, as an input with no enabled streams is functionally useless and shouldn't be relied upon for any sort of behavior.
This relates to support case #01619569 which I have been able to reproduce using the steps below.
Problem
We were having issues where all logs from Cloudwatch that were collected by agents running in AWS would stop at the same time. There were no errors in the logs or diagnostics to indicate the issue. When disabling recently added integrations, the problem would resolve itself. The integrations that were disabled were similar apart from the stream on the input being disabled. Simply creating an integration with the stream disabled doesn't cause the issue, the agent also needs to be then restarted.
Background
We have Elastic agents running in AWS Fargate which are dedicated to retrieving logs out of AWS services all running the same policy. There are about 40 integrations for Cloudwatch logs due to a combination of AWS accounts and regions. We tried to improve the developer experience for adding integrations by providing a Terraform module that created the integration for them using the
elasticstack_fleet_integration_policy
resource. The Terraform resource in the module turned out to be misconfigured with the input enabled and stream disabled, it's not possible to configure it like this in the Fleet UI but it is possible via the API.AWS Cloudwatch Integration Configurations
Terraform Resource
```hcl input { enabled = true input_id = "cloudwatch-aws-cloudwatch" streams_json = jsonencode({ "aws.cloudwatch_logs" : { "enabled" : false, "vars" : { "start_position" : "beginning", "api_timeput" : "120s", "processors" : "", "scan_frequency" : "1m", "tags" : [ "forwarded", "aws-cloudwatch-logs", "${local.account_name[0]}" ], "log_group_arn" : "arn:aws:logs:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:log-group:${var.log_group_path}/${var.log_group_name}:*", "api_sleep" : "200ms", "data_stream.dataset" : "aws_${local.log_type[0]}_${local.environment_type}", "log_streams" : [], "preserve_original_event" : false } } }) } ```Elastic Console API Request (truncated)
```json POST kbn:/api/fleet/package_policies { "policy_id": "", "package": { "name": "aws", "version": "2.15.2", "experimental_data_stream_features": [] }, "name": "cloudwatch-log-test-failure", "description": "", "namespace": "default", "inputs": { "cloudwatch-aws-cloudwatch": { "enabled": true, "streams": { "aws.cloudwatch_logs": { "enabled": false, "vars": { "log_group_arn": "arn:aws:logs:us-west-2::log-group:/elastic/cloudwatch/failure-test:*", "log_streams": [], "start_position": "beginning", "scan_frequency": "1m", "api_timeput": "120s", "api_sleep": "200ms", "tags": [ "forwarded", "aws-cloudwatch-logs" ], "preserve_original_event": false, "data_stream.dataset": "cloudwatch_log_failure_testing" } } } } "vars": { "access_key_id": "", "default_region": "" } } ```Reproduction Steps
/elastic/cloudwatch/failure-test-logs
and/elastic/cloudwatch/failure-test
cloudwatch-log-test-logs
, no auth settings are configured as it will use the instance profile, fill out the log group ARN and set a dataset name which is optional but I set it tocloudwatch_log_failure_testing
to make the logs easier to find and cleanup.streams.enabled
variable fromtrue
tofalse
sudo systemctl restart elastic-agent
It appears that Cloudwatch input is not able to start after the agent is restarted.
You can see in the logs the messages below each minute when the log collection occurs but after the restart the messages are not present
There are no logs from what I can tell that indicate why the input is unable to start or why one misconfigured input would cause all the other inputs to fail.
I will attach diagnostics from my reproduction to the support case. I initiated the restart of the agent to induce the failure at
"@timestamp":"2024-05-21T23:42:08.611Z"