GoogleCloudPlatform / ops-agent

Apache License 2.0
141 stars 68 forks source link

Lua syntax error in ops agent fluent bit for Ops agent 2.14 #575

Closed jimdoescode closed 2 years ago

jimdoescode commented 2 years ago

Logging agent is failing to start and logging a Lua error.

Output from running: sudo systemctl status google-cloud-ops-agent"*"

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2022-05-03 12:25:21 UTC; 20s ago
    Process: 209041 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STAT>
    Process: 209045 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf ->
   Main PID: 209045 (code=exited, status=255/EXCEPTION)

May 03 12:25:21 staging-webserver-1 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Main process exited, code=exited, status=255/EXCEPTION
May 03 12:25:21 staging-webserver-1 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
May 03 12:25:21 staging-webserver-1 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 11.
May 03 12:25:21 staging-webserver-1 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
May 03 12:25:21 staging-webserver-1 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
May 03 12:25:21 staging-webserver-1 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
May 03 12:25:21 staging-webserver-1 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

Log from /var/log/google-cloud-ops-agent/subagents/logging-module.log

[2022/05/03 12:25:21] [ info] [input:storage_backlog:storage_backlog.6] queue memory limit: 47.7M
[2022/05/03 12:25:21] [error] [luajit] error loading script: ...ps-agent-fluent-bit/e7e11249be03393f6a60033c65ad69bf.lua:7: 'end' expected (to close 'if' at line 4) near 'return'
[2022/05/03 12:25:21] [error] Failed initialize filter lua.5
[2022/05/03 12:25:21] [ info] [input] pausing fluentbit_metrics.0
[2022/05/03 12:25:21] [ info] [input] pausing tail.1
[2022/05/03 12:25:21] [ info] [input] pausing tail.2
[2022/05/03 12:25:21] [ info] [input] pausing tail.3
[2022/05/03 12:25:21] [ info] [input] pausing tail.4
[2022/05/03 12:25:21] [ info] [input] pausing tail.5
[2022/05/03 12:25:21] [ info] [input] pausing storage_backlog.6
jimdoescode commented 2 years ago

Reverting back to 2.13 corrects the issue.

franciscovalentecastro commented 2 years ago

I encountered this error (or very similar) recently

[2022/05/03 12:25:21] [error] [luajit] error loading script: ...ps-agent-fluent-bit/e7e11249be03393f6a60033c65ad69bf.lua:7: 'end' expected (to close 'if' at line 4) near 'return'

I made a fix #568 which will be released with version 2.15 .

jimdoescode commented 2 years ago

@franciscovalentecastro I was looking at that PR as it did seem like it might solve the issue. Unfortunately my Lua knowledge isn't as good as most other languages 😅

I'll wait for 2.15 to drop and confirm. Thanks for the effort!

quentinmit commented 2 years ago

@jimdoescode I'm curious what config file you are using - I don't recognize the md5 hash of the generated script, and Francisco's change should only affect non-default configs. Can you paste a copy of your config?

jimdoescode commented 2 years ago
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# <== Enter custom agent configurations in this file.
# See https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/configuration
# for more details.
logging:
    receivers:
        stage_application_monolog:
            type: files
            include_paths:
                - /var/opt/application/logs/app.log
        application_workers:
            type: files
            include_paths:
                - /var/opt/application/logs/workers.log
        apache_access:
            type: apache_access
        apache_errors:
            type: apache_error
    processors:
        hide_pulse_requests:
            type: exclude_logs
            match_any:
                - 'httpRequest.requestUrl = "/pulse.html"'
        monolog_processor:
            type: parse_json
            field: message
            time_key: timestamp
            time_format: "%Y-%m-%dT%H:%M:%S.%L"
    service:
        pipelines:
            default_pipeline:
                receivers:
                    - stage_application_monolog
                processors:
                    - monolog_processor
            misc_pipeline:
                receivers:
                    - application_workers
            apache_access:
                receivers:
                    - apache_access
                processors:
                    - hide_pulse_requests
            apache_errors:
                receivers:
                    - apache_errors
metrics:
    receivers:
        hostmetrics:
            type: hostmetrics
            collection_interval: 30s
        apache_metrics:
            type: apache
    processors:
        metrics_filter:
            type: exclude_metrics
            metrics_pattern: []
    service:
        pipelines:
            default_pipeline:
                receivers: [hostmetrics]
                processors: [metrics_filter]
            apache:
                receivers:
                    - apache_metrics

The app log and workers logs have json formatted log lines. The pulse.html file is just a load balancer health check.

quentinmit commented 2 years ago

Gotcha, this is #568 then. It was triggered by your use of httpRequest.requestUrl in exclude_logs. (Matching on fields that are not in jsonPayload. is broken without #568)

jimdoescode commented 2 years ago

Awesome, I'll close this then and wait for 2.15!

Thanks for clarification and your alls hard work!