elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
21 stars 144 forks source link

Encountered channel not found error on adding Windows integration to the Windows agent. #5746

Open amolnater-qasource opened 1 month ago

amolnater-qasource commented 1 month ago

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 78938
COMMIT: 7b832691e8b07c67b411da95b0398a04711da864

Artifact: https://snapshots.elastic.co/8.16.0-39df64b4/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Image

Host: Windows Server 2022- Test Signing ON

Preconditions:

  1. 8.16.0 SNAPSHOT Kibana cloud environment should be available.
  2. Agent should be installed with policy having System and Windowsintegrations.

Steps to reproduce:

  1. Navigate to Agents tab.
  2. Observe the Agent gets unhealthy and navigate to policy details page.
  3. Observe error for Windows integration: Encountered channel not found error

Expected Result: No error should be displayed on adding Windows integration to the Windows agent.

Logs: elastic-agent-diagnostics-2024-10-09T06-48-15Z-00.zip

Screenshots: Image Image

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

amolnater-qasource commented 1 month ago

@muskangulati-qasource Please review.

muskangulati-qasource commented 1 month ago

Secondary review is done for this ticket!

cmacknz commented 1 month ago

I see this is privileged/admin agent looking in agent-info.yaml:

agent_id: 881c5687-32af-4bf9-b62f-4b74f2f688ec
headers: {}
log_level: info
snapshot: true
unprivileged: false
version: 8.16.0

Also that this is coming from the winlog input. Tagging @nfritts and @elastic/sec-windows-platform.

            input-winlog-default-winlog-windows-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                message: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
                payload:
                    streams:
                        winlog-windows.forwarded-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.powershell-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.powershell_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.sysmon_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
                            status: DEGRADED
bjmcnic commented 1 month ago

This issue has arisen following changes in this PR (https://github.com/elastic/beats/pull/40163).

The default configuration for for the Windows Integration has historically included Sysmon Operational channel. Sysmon is not a core component of Windows, it's a SysInternals tool (https://learn.microsoft.com/en-us/sysinternals/downloads/sysmon) that users can download and use at their discretion. As such, the Windows Integration has historically failed to open the Sysmon Operational channel, but that didn't propagate a DEGRADED status until the recent PR.

Users can remedy the degraded status by either installing Sysmon, causing the channel to exist; or by deselecting the Sysmon Operational channel in the Windows Integration configuration...

Image

Possible solutions:

Thoughts @cmacknz @nfritts @andrewkroh ?

cmacknz commented 1 month ago

Change the Windows Integration default to not include Sysmon.

This sounds like the most correct path if sysmon is not expected to be present the majority of the time. The counter argument is that this is a breaking change.

Something we did for some of the system metricsets that were in a similar situation is keep the error message but report the status as healthy, since the input was working as well as it could with the configuration of the host system it was running on.

jamiehynds commented 4 weeks ago

In an ideal world, we'd move Sysmon out of Windows and have it as a standalone integration but that'd be very disruptive for existing users and would likely impact rules, dashboards, etc.

As a quick fix, could we exclude Sysmon from our DEGRADED logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED status?

intxgo commented 3 weeks ago

Can't Agent get it from https://live.sysinternals.com/Sysmon64.exe and install if it's missing? When adding policy with sysmon data collection.

bjmcnic commented 3 weeks ago

@jamiehynds

As a quick fix, could we exclude Sysmon from our DEGRADED logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED status?

I don't see a technical reason we couldn't just insert a check for whether we're trying to grab that particular channel near the code that changed. But that's filebeat code and not the Windows integration code. That'd be kind of an awkward place for the check in the long term and wouldn't scale well if we want to handle other things differently.

I noticed the PR that changed this was addressing: https://github.com/elastic/beats/issues/39735. Which related to wanting to see failures for channels when permission is denied, typically when Agent is installed unprivileged. I recreated that and now DO see the desired Access is denied error:

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '8300'
   │  └─ system/metrics-default-system/metrics-system-68951c20-696b-4518-b64c-63de7317ef29
   │     └─ status: (DEGRADED) Error fetching data for metricset system.diskio: disk io counters: cannot open new key in the registry in order to enable the performance counters: Access is denied.
   ├─ windows/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5976'
   │  └─ windows/metrics-default-windows/metrics-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
   │     └─ status: (FAILED) 1 error: initialization of reader failed: failed to expand counter (query='\Process(*)\% Processor Time'): Unable to connect to the specified computer or the computer is offline.
   └─ winlog-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '5416'
      ├─ winlog-default-winlog-system-68951c20-696b-4518-b64c-63de7317ef29
      │  └─ status: (DEGRADED) failed to open Windows Event Log channel "Security": Access is denied.
      └─ winlog-default-winlog-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
         └─ status: (DEGRADED) Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.

The thing that strikes me there is we have multiple types of failures for winlog-default. One we newly want reported that hadn't been, and one we don't want reported that previously hadn't been.

I wonder if a good long term solution would be for the integration to feed in some type of filter data along with each channel it wants. Some enum or struct that tells it to actually degrade on access denied errors for this channel, but ignore not found errors for this channel, or warn/log (but not DEGRADE) for some other type of error for some other channel. Such that the integration can communicate how specific channel subscription failures should impact the integration's state. Trying to place that logic in code in filebeat seems awkward.

I'm not sure that change could be ready for 8.16.0. Perhaps we should rollback the change that caused this issue to arise and continue to tolerate the missing Access is denied as unprivileged appears to still be technically beta. And then we could incorporate the channel specific failure handling from the Windows integration in the next release. Thoughts? @cmacknz @jamiehynds @nfritts

cmacknz commented 3 weeks ago

Having a per input way to turn off the "errors mark the Beat as degraded" would make sense vs just reverting the entire feature. This config could later expand into a list of specific errors to mute. For system/metrics we had similar ideas in https://github.com/elastic/beats/issues/40543 but it hasn't been implemented yet.

We have been more focused on fixing the specific errors, which in many cases have been actual bugs or permissions errors we were handling improperly. For system/metrics we also only get these errors when unprivileged.

The OTel collector process scraper allows muting specific categories of error which is what we'd eventually want to emulate https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#process

cmacknz commented 3 weeks ago

@bjmcnic followed up separately and we agreed we should revert the winlog specific change here in 8.16 while we work on a proper fix for 8.17 considering that:

  1. This affects privileged agents for winlog in the default configuration AKA everybody on windows.
  2. The final 8.16.0 BC is Thursday
  3. The linked PR is specific to the winlog input

There's not time to do a more in depth fix and the current state will probably lead to a flood of support cases. The revert PR is in https://github.com/elastic/beats/pull/41468

cmacknz commented 3 weeks ago

https://github.com/elastic/beats/pull/41468 was merged, this is now reverted from 8.16.

amolnater-qasource commented 5 days ago

Hi Team,

While testing on 8.17.0 SNAPSHOT, we have found this issue reproducible there too.

Observations:

Build details: VERSION: 8.17.0 SNAPSHOT BUILD: 80188 COMMIT: fdb16ae8cbdf4236db3696aa00d0bb98c943d864 Artifact Link: https://snapshots.elastic.co/8.17.0-7a041bf5/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip

Image

Screenshot: Image

Logs: elastic-agent-diagnostics-2024-11-18T09-03-35Z-00.zip

Please let us know if us know if anything else is required from our end.

Thanks!

bjmcnic commented 4 days ago

It's fixed in 8.16.0. Looks like the revert of the change was to 8.16 branch, but hasn't hit main.

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" version
Binary: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)
Daemon: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status --output full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 4eb139dd-aa42-4ec4-9463-39f1b7b37f60
   │  ├─ version: 8.16.0
   │  └─ commit: 3f07f2fd932f20e972399306d394763ade6b74b4
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5128'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '520'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '3876'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5944'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-f57fba90-d162-43bb-8f6b-546015d84c78
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '3500'
   │  ├─ system/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ system/metrics-default-system/metrics-system-f57fba90-d162-43bb-8f6b-546015d84c78
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ windows/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '2220'
   │  ├─ windows/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ windows/metrics-default-windows/metrics-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ winlog-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '4692'
      ├─ winlog-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ winlog-default-winlog-system-f57fba90-d162-43bb-8f6b-546015d84c78
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ winlog-default-winlog-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

c:\>
cmacknz commented 4 days ago

Those two PRs will add the change to main and the 8.x branch