Open amolnater-qasource opened 1 month ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
@muskangulati-qasource Please review.
Secondary review is done for this ticket!
I see this is privileged/admin agent looking in agent-info.yaml:
agent_id: 881c5687-32af-4bf9-b62f-4b74f2f688ec
headers: {}
log_level: info
snapshot: true
unprivileged: false
version: 8.16.0
Also that this is coming from the winlog input. Tagging @nfritts and @elastic/sec-windows-platform.
input-winlog-default-winlog-windows-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
message: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
payload:
streams:
winlog-windows.forwarded-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
error: ""
status: HEALTHY
winlog-windows.powershell-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
error: ""
status: HEALTHY
winlog-windows.powershell_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
error: ""
status: HEALTHY
winlog-windows.sysmon_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
error: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
status: DEGRADED
This issue has arisen following changes in this PR (https://github.com/elastic/beats/pull/40163).
The default configuration for for the Windows Integration has historically included Sysmon Operational channel. Sysmon is not a core component of Windows, it's a SysInternals tool (https://learn.microsoft.com/en-us/sysinternals/downloads/sysmon) that users can download and use at their discretion. As such, the Windows Integration has historically failed to open the Sysmon Operational channel, but that didn't propagate a DEGRADED
status until the recent PR.
Users can remedy the degraded status by either installing Sysmon, causing the channel to exist; or by deselecting the Sysmon Operational channel in the Windows Integration configuration...
Possible solutions:
DEGRADED
status when a configured channel is not found.
DEGRADED
status that could indicate an unexpectedly missing non-default channel.Thoughts @cmacknz @nfritts @andrewkroh ?
Change the Windows Integration default to not include Sysmon.
This sounds like the most correct path if sysmon is not expected to be present the majority of the time. The counter argument is that this is a breaking change.
Something we did for some of the system metricsets that were in a similar situation is keep the error message but report the status as healthy, since the input was working as well as it could with the configuration of the host system it was running on.
In an ideal world, we'd move Sysmon out of Windows and have it as a standalone integration but that'd be very disruptive for existing users and would likely impact rules, dashboards, etc.
As a quick fix, could we exclude Sysmon from our DEGRADED
logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED
status?
Can't Agent get it from https://live.sysinternals.com/Sysmon64.exe
and install if it's missing? When adding policy with sysmon data collection.
@jamiehynds
As a quick fix, could we exclude Sysmon from our DEGRADED logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED status?
I don't see a technical reason we couldn't just insert a check for whether we're trying to grab that particular channel near the code that changed. But that's filebeat code and not the Windows integration code. That'd be kind of an awkward place for the check in the long term and wouldn't scale well if we want to handle other things differently.
I noticed the PR that changed this was addressing: https://github.com/elastic/beats/issues/39735. Which related to wanting to see failures for channels when permission is denied, typically when Agent is installed unprivileged. I recreated that and now DO see the desired Access is denied
error:
c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (DEGRADED) 1 or more components/units in a failed state
├─ system/metrics-default
│ ├─ status: (HEALTHY) Healthy: communicating with pid '8300'
│ └─ system/metrics-default-system/metrics-system-68951c20-696b-4518-b64c-63de7317ef29
│ └─ status: (DEGRADED) Error fetching data for metricset system.diskio: disk io counters: cannot open new key in the registry in order to enable the performance counters: Access is denied.
├─ windows/metrics-default
│ ├─ status: (HEALTHY) Healthy: communicating with pid '5976'
│ └─ windows/metrics-default-windows/metrics-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
│ └─ status: (FAILED) 1 error: initialization of reader failed: failed to expand counter (query='\Process(*)\% Processor Time'): Unable to connect to the specified computer or the computer is offline.
└─ winlog-default
├─ status: (HEALTHY) Healthy: communicating with pid '5416'
├─ winlog-default-winlog-system-68951c20-696b-4518-b64c-63de7317ef29
│ └─ status: (DEGRADED) failed to open Windows Event Log channel "Security": Access is denied.
└─ winlog-default-winlog-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
└─ status: (DEGRADED) Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.
The thing that strikes me there is we have multiple types of failures for winlog-default
. One we newly want reported that hadn't been, and one we don't want reported that previously hadn't been.
I wonder if a good long term solution would be for the integration to feed in some type of filter data along with each channel it wants. Some enum or struct that tells it to actually degrade on access denied errors for this channel, but ignore not found errors for this channel, or warn/log (but not DEGRADE
) for some other type of error for some other channel. Such that the integration can communicate how specific channel subscription failures should impact the integration's state. Trying to place that logic in code in filebeat seems awkward.
I'm not sure that change could be ready for 8.16.0. Perhaps we should rollback the change that caused this issue to arise and continue to tolerate the missing Access is denied
as unprivileged appears to still be technically beta. And then we could incorporate the channel specific failure handling from the Windows integration in the next release. Thoughts? @cmacknz @jamiehynds @nfritts
Having a per input way to turn off the "errors mark the Beat as degraded" would make sense vs just reverting the entire feature. This config could later expand into a list of specific errors to mute. For system/metrics we had similar ideas in https://github.com/elastic/beats/issues/40543 but it hasn't been implemented yet.
We have been more focused on fixing the specific errors, which in many cases have been actual bugs or permissions errors we were handling improperly. For system/metrics we also only get these errors when unprivileged.
The OTel collector process scraper allows muting specific categories of error which is what we'd eventually want to emulate https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#process
@bjmcnic followed up separately and we agreed we should revert the winlog specific change here in 8.16 while we work on a proper fix for 8.17 considering that:
There's not time to do a more in depth fix and the current state will probably lead to a flood of support cases. The revert PR is in https://github.com/elastic/beats/pull/41468
https://github.com/elastic/beats/pull/41468 was merged, this is now reverted from 8.16.
Hi Team,
While testing on 8.17.0 SNAPSHOT, we have found this issue reproducible there too.
Observations:
Build details: VERSION: 8.17.0 SNAPSHOT BUILD: 80188 COMMIT: fdb16ae8cbdf4236db3696aa00d0bb98c943d864 Artifact Link: https://snapshots.elastic.co/8.17.0-7a041bf5/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip
Screenshot:
Logs: elastic-agent-diagnostics-2024-11-18T09-03-35Z-00.zip
Please let us know if us know if anything else is required from our end.
Thanks!
It's fixed in 8.16.0
. Looks like the revert of the change was to 8.16
branch, but hasn't hit main
.
c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" version
Binary: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)
Daemon: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)
c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status --output full
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (HEALTHY) Running
├─ info
│ ├─ id: 4eb139dd-aa42-4ec4-9463-39f1b7b37f60
│ ├─ version: 8.16.0
│ └─ commit: 3f07f2fd932f20e972399306d394763ade6b74b4
├─ beat/metrics-monitoring
│ ├─ status: (HEALTHY) Healthy: communicating with pid '5128'
│ ├─ beat/metrics-monitoring
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ beat/metrics-monitoring-metrics-monitoring-beats
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ filestream-monitoring
│ ├─ status: (HEALTHY) Healthy: communicating with pid '520'
│ ├─ filestream-monitoring
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ filestream-monitoring-filestream-monitoring-agent
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ http/metrics-monitoring
│ ├─ status: (HEALTHY) Healthy: communicating with pid '3876'
│ ├─ http/metrics-monitoring
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ http/metrics-monitoring-metrics-monitoring-agent
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ log-default
│ ├─ status: (HEALTHY) Healthy: communicating with pid '5944'
│ ├─ log-default
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ log-default-logfile-system-f57fba90-d162-43bb-8f6b-546015d84c78
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ system/metrics-default
│ ├─ status: (HEALTHY) Healthy: communicating with pid '3500'
│ ├─ system/metrics-default
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ system/metrics-default-system/metrics-system-f57fba90-d162-43bb-8f6b-546015d84c78
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
├─ windows/metrics-default
│ ├─ status: (HEALTHY) Healthy: communicating with pid '2220'
│ ├─ windows/metrics-default
│ │ ├─ status: (HEALTHY) Healthy
│ │ └─ type: OUTPUT
│ └─ windows/metrics-default-windows/metrics-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
└─ winlog-default
├─ status: (HEALTHY) Healthy: communicating with pid '4692'
├─ winlog-default
│ ├─ status: (HEALTHY) Healthy
│ └─ type: OUTPUT
├─ winlog-default-winlog-system-f57fba90-d162-43bb-8f6b-546015d84c78
│ ├─ status: (HEALTHY) Healthy
│ └─ type: INPUT
└─ winlog-default-winlog-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
├─ status: (HEALTHY) Healthy
└─ type: INPUT
c:\>
Those two PRs will add the change to main and the 8.x branch
Kibana Build details:
Artifact: https://snapshots.elastic.co/8.16.0-39df64b4/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip
Host: Windows Server 2022- Test Signing ON
Preconditions:
Steps to reproduce:
Encountered channel not found error
Expected Result: No error should be displayed on adding Windows integration to the Windows agent.
Logs: elastic-agent-diagnostics-2024-10-09T06-48-15Z-00.zip
Screenshots: