elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Meta] [Fleet] [Agent] [Integrations] Elastic Agent Logs Errors Review #91804

Open dikshachauhan-qasource opened 3 years ago

dikshachauhan-qasource commented 3 years ago

The team is starting up a Meta issue to record the known errors found upon starting of Elastic-Agent and Elastic-Endpoint / Metricbeat / Filebeat

We will keep the lists here, all together so it is easier to review.

When a tester finds a new 'error' in the logs, let us log it separately and tag it in this description 'meta' issue can collect disparate sources of errors that may relate. This will help us map what we see in the logs, log individual issues to track, prevent noise from new issues, and prevent issues from slipping by. Feedback is welcome on the project.

Intentions, further: We intend to keep the comments section as clean as we can - anyone can post comments happily, but we will review and likely remove them and add / move the errors up (if they don't have access). Looking forward to issue commentary and links to bugs (1 per each error message)

Dataset Version seen Relating keywords other notes / OS Log error
Agent 7.10.x Agent upgrade all OS - elastic/beats/issues/23850 failed to load markeropen C:\Program Files\Elastic\Agent\data.update-marker: The system cannot find the file specified
Agent 7.11 on changing agent logging level ? failed to dispatch actions, error: acknowledge 0 actions '[]' for elastic-agent '8adbc8c0-6b99-11eb-9735-1d60013fe6b1' failed: fail to ack to fleet: Post "https://a611e402a7d5476ea8a937892b05eb01.europe-west1.gcp.cloud.es.io:443/api/fleet/agents/8adbc8c0-6b99-11eb-9735-1d60013fe6b1/acks?": context canceled
Agent 7.12 Agent fleet page All OSs [elastic_agent][error] Could not communicate with Checking API will retry, error: Status code: 503, Kibana returned an error: Service Unavailable, message: socket hang up
Agent 7.12 Agent fleet page All OSs [elastic_agent][error] Could not communicate with Checking API will retry, error: Status code: 503, Kibana returned an error: Service Unavailable, message: Response aborted while reading the body
Agent 7.12 BC2 Endpoint integration not deployed successfully in agent policy Debian OS Agent and Win 2008 Agent [elastic_agent][error] failed to dispatch actions, error: fail to generate program configuration: InjectStreamProcessorRule: processors is not a list
Agent 7.12BC3 Agent Logs Tab[NO DEPENDENCY ON ENDPOINT SECURITY] Windows 10 specific [elastic_agent][error] failed writing connection info to spawned application: failed to write connection information: write 1: The pipe has been ended
Agent 7.12BC5 & 7.14.2 BC1 Agent installation with endpoint observed on Windows [elastic_agent][error] Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: context canceled: context canceled
Agent 7.13BC3 Agent installation observed on All OSs [elastic_agent][error] Could not communicate with Checking API will retry, error: fail to checkin to fleet: Post "https://mainqa-atlcolo-10-0-6-140.eng.endgames.local:8220/api/fleet/agents/96c02dc2-8624-43f9-aedc-3f2dc8f02d74/checkin?": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Agent 7.13BC3 Agent installation observed on All OSs [elastic_agent][error] 2021-04-29T04:41:46-04:00: type: 'ERROR': sub_type: 'FAILED' message: Application: fleet-server--7.13.0[96c02dc2-8624-43f9-aedc-3f2dc8f02d74]: State changed to CRASHED: exited with code: 2
Agent 7.15BC5 When Endpoint added in agent policy on mac OS [elastic_agent.endpoint_security][error] Interfaces.cpp:123 Interface name too long to get MAC address; name [stf0], namelen [4], sdl_alen [0], sdl_nlen [4] (see net/if_dl.h -> sdl_data)
Agent 7.14.2 BC1 without Endpoint on Linux and Mac [elastic_agent][error] 2021-09-16T06:40:21Z - message: Application: metricbeat--7.14.2[220787bf-d5a9-4b7d-ab5e-fe8a4a0aee8e]: State changed to FAILED: context canceled - type: 'ERROR' - sub_type: 'FAILED'
Agent 7.14.2 BC1 without Endpoint on Linux [elastic_agent][error] Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: context canceled: context canceled
Agent 7.14.2 BC1, 8.0 snapshot, 7.17 BC2 Agent/ Machine reboot on All OS [elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://d0ec8aeec2cc4f91a234389e8d507ac4.fleet.europe-west1.gcp.cloud.es.io:443/api/fleet/agents/80010e62-3de9-4634-92a9-560d7e7eeca9/checkin?": context canceled
Agent 7.14.2 BC1 on reassigning policy without Endpoint on Linux [elastic_agent][error] 2021-09-16T08:03:46-04:00 - message: Application: filebeat--7.14.2[d65b548b-6f6d-4b2c-b4e4-16cc3c3311ef]: State changed to FAILED: 1 error occurred:* 2 errors: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::17339927-64768, Finished: false, Fileinfo: &{secure 1098 384 {19043336 63767390409 0x5556183a4320} {64768 17339927 1 33152 0 0 0 0 1098 4096 8 {1631792852 216038931} {1631793609 19043336} {1631793609 19043336} [0 0 0]}}, Source: /var/log/secure, Offset: 1098, Timestamp: 2021-09-16 08:02:32.464767172 -0400 EDT m=+14.149141116, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 17339927-64768}; Error creating runner from config: Can only start an input when all related states are finished: {Id: native::17339916-64768, Finished: false, Fileinfo: &{messages 8105 384 {425064563 63767390529 0x5556183a4320} {64768 17339916 1 33152 0 0 0 0 8105 4096 16 {1631792852 228039232} {1631793729 425064563} {1631793729 425064563} [0 0 0]}}, Source: /var/log/messages, Offset: 8105, Timestamp: 2021-09-16 08:02:32.464767172 -0400 EDT m=+14.149141116, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 17339916-64768}- type: 'ERROR' - sub_type: 'FAILED'
Agent 8.0 snapshot On reassigning policy with Endpoint on linux OS [elastic_agent][error] failed to commit acker after acknowledging action with id '%!s(func() string=0x55d9b69ccc20)'
Agent 7.17 BC2 Machine reboot on aarch64 agent [elastic_agent][error] context canceled
Agent 8.2 BC4 Reboot/Adding ES/Changing logging level All OS's [elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://d559b897002c4071b982ca8bdd9873a7.fleet.us-central1.gcp.foundit.no:443/api/fleet/agents/a1d5b842-0318-4e17-845f-f8f5c4988012/checkin?": context canceled
elasticmachine commented 3 years ago

Pinging @elastic/fleet (Team:Fleet)

elasticmachine commented 3 years ago

Pinging @elastic/security-solution (Team: SecuritySolution)

dikshachauhan-qasource commented 3 years ago

Related to https://github.com/elastic/kibana/issues/90891

amolnater-qasource commented 3 years ago

Hi @EricDavisX We have reported 03 issues for the "errors" in logs which we were able to reproduce. Below are the links to the reported tickets:

Please let us know if anything else is required from our end. Thanks QAS

dikshachauhan-qasource commented 3 years ago

Hi @EricDavisX ,

Today, we have checked out Charlie's environment and observed below error logs. Thus, reported issue for same.

https://github.com/elastic/beats/issues/27099

Thanks QAS

dikshachauhan-qasource commented 3 years ago

Hi @EricDavisX

While performing today testing on 7.14.2 BC1, we have observed above recorded errors logs that are currently updated in table.

Further, we have also logged https://github.com/elastic/beats/issues/27968 issue specifically observed on Linux.

Thanks QAS

EricDavisX commented 3 years ago

tagging this with another logs citation/clean up issue: https://github.com/elastic/beats/issues/28000

amolnater-qasource commented 2 years ago

Hi @EricDavisX We have tested Agent error logs on 7.16.0 BC-5 Kibana cloud-production environment. We haven't observed any major error logs for elastic-agent on Installation or reboot.

OS's covered:

Only below error log is observed on Windows and MAC agents on reboot:

15:48:27.660
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://f2a2ed2f5f8e4c42a03974480e3cf6ea.fleet.europe-west1.gcp.cloud.es.io:443/api/fleet/agents/fda78eec-3833-49b2-ad63-cb9bd4303e45/checkin?": context canceled

Please let us know if anything else is required from our end. Thanks

EricDavisX commented 2 years ago

I recommend this is still a useful exercise to perform, ideally we'd leave an Agent up for a period of time (hours or more) and hopefully perform various non-destructive testing / usage on it, and then check the error logs (and add issues for the new discrete error log messages we see).

I am un-assigning myself, but Diksha and Sagar will continue on the testing here for each minor / major release.

dikshachauhan-qasource commented 2 years ago

Hi Team,

Today we performed testing agent error logs on different OSs under 8.0.0 snapshot build and found few errors. We have updated our table with entry of 8.0 snapshot build.

Tested on Build: BUILD: 49040 COMMIT: 155e06787e48de9a8de4345d86a826e95edf32ec ARTIFACT: https://snapshots.elastic.co/8.0.0-1df952d9/summary-8.0.0-SNAPSHOT.html

Thanks QAS

dikshachauhan-qasource commented 2 years ago

Hi @EricDavisX

We have tested error logs collection scenarios on 7.17 BC2 build and updated table above with version entries.

BUILD: 46488 COMMIT: a6fd029464413f6979099d7a3d4232c5194a269d ARTIFACT: https://staging.elastic.co/7.17.0-1bd53ff7/summary-7.17.0.html

Please let us know if more info is required.

Thanks QAS