Graylog2 / collector-sidecar

Manage log collectors through Graylog
https://www.graylog.org/
Other
268 stars 56 forks source link

Sidecar does not correctly detect a stalled filebeat journald input #471

Open nroach44 opened 1 year ago

nroach44 commented 1 year ago

Problem description

When the journald files are corrupted (or the journald input otherwise fails) the sidecar does not know, and fails to report errors back to the graylog server.

It's worth noting that filebeat doesn't exit when this occurs, it just stops the journald input. I'm pretty confident that this isn't a normal situation for the sidecar.

Possible upstream issue: https://github.com/elastic/beats/issues/32782

Steps to reproduce the problem

  1. Have corrupted journal files:

journalctl --verify

PASS: /var/log/journal/cd7f7844c032416dafc4ea25fcfb0871/user-1000@64472e512c6c4c438219d1d337f19579-00000000000b7015-0005e7ab092b31d6.journal                       
2411ea0: Invalid entry item (18/21 offset: 000000                                                                                                                  
2411ea0: Invalid object contents: Bad message                                                                                                                      
File corruption detected at /var/log/journal/cd7f7844c032416dafc4ea25fcfb0871/system@0005e87773c4cec2-8daec554d00ac2a0.journal~:2411ea0 (of 41943040 bytes, 90%).  
FAIL: /var/log/journal/cd7f7844c032416dafc4ea25fcfb0871/system@0005e87773c4cec2-8daec554d00ac2a0.journal~ (Bad message)
PASS: /var/log/journal/cd7f7844c032416dafc4ea25fcfb0871/system@0005e7f6999d17a1-0a70d8fce6f2baff.journal~                          
  1. Have a sidecar installed on the server, with something like the following set up as a filebeat config assigned to the sidecar:
# Needed for Graylog
fields_under_root: true
fields.collector_node_id: ${sidecar.nodeName}
fields.gl2_source_collector: ${sidecar.nodeId}

filebeat.inputs:
- type: journald
  id: everything

output.logstash:
  enabled: true
  slow_start: true
  bulk_max_size: 512
  hosts: ["graylog.domain:1234"]
  backoff.init: 10
  backoff.max: 300

logging:
  level: warning
  to_files: false
  to_syslog: true
  json: false

path:
  data: /var/lib/graylog-sidecar/collectors/filebeat/data
  logs: /var/lib/graylog-sidecar/collectors/filebeat/log
  home: /usr/share/filebeat
  1. Observe that no log entries make it to the server
  2. Observe the filebeat output in the journal
May 19 19:55:16 hostname systemd[1]: Started Wrapper service for Graylog controlled collector.
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="Using node-id: <UUID>"
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="No node name was configured, falling back to hostname"
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="Starting signal distributor"
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="Adding process runner for: filebeat-63a12208827d252d2f7931ca"
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="[filebeat-63a12208827d252d2f7931ca] Configuration change detected, rewriting configuration file."
May 19 19:55:16 hostname filebeat[25072]: 2023-05-19T19:55:16.175+0800 WARN map[file.line:175 file.name:beater/filebeat.go] Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning. {"ecs.version": "1.6.0"}
May 19 19:55:16 hostname graylog-sidecar[25062]: time="2023-05-19T19:55:16+08:00" level=info msg="[filebeat-63a12208827d252d2f7931ca] Starting (exec driver)"
May 19 19:55:16 hostname filebeat[25080]: 2023-05-19T19:55:16.237+0800 WARN map[file.line:175 file.name:beater/filebeat.go] Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning. {"ecs.version": "1.6.0"}
May 19 19:55:16 hostname filebeat[25080]: 2023-05-19T19:55:16.287+0800 WARN map[file.line:307 file.name:beater/filebeat.go] Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning. {"ecs.version": "1.6.0"}
May 19 19:55:16 hostname filebeat[25080]: 2023-05-19T19:55:16.287+0800 WARN [input] map[file.line:102 file.name:v2/loader.go] EXPERIMENTAL: The journald input is experimental        {"ecs.version": "1.6.0"}
May 19 19:55:16 hostname filebeat[25080]: 2023-05-19T19:55:16.324+0800 ERROR [input.journald] map[file.line:124 file.name:compat/compat.go] Input 'journald' failed with: input.go:130: input everything failed (id=everything)
                                                  failed to read message field: bad message        {"ecs.version": "1.6.0"}
  1. Observe that the collector status still shows as "Running"
  2. Remove the corrupt file, restart the service and view the collected logs

Environment