fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.82k stars 1.58k forks source link

[Windows 2016] Fluent Bit service enters a flush loop when trying to stop the service #4300

Closed qingling128 closed 2 years ago

qingling128 commented 2 years ago

Bug Report

Describe the bug When stopping / restarting Fluent Bit service, occasionally it enters a flush loop and hangs (In the experiment we observed it for 10+ minutes. It is still trying to flush log entries, even though we leave the grace period as 5 seconds. For the context, we are using file based buffering, so flush_at_shutdown should default to false, which means it is not expected for Fluent Bit to try flushing log entries prior to shutdown.

Expected behavior

Fluent Bit stops within a reasonable time (aka close to the grace config value) after gettingservice stop` signal.

Sample logs

[2021/11/09 02:17:13] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:18] [ info] [engine] service stopped
[2021/11/09 02:17:18] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:18] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:18] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:18] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:18] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:18] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:18] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:18] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:18] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:23] [error] [upstream] connection #2096 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/09 02:17:23] [ info] [engine] service stopped
[2021/11/09 02:17:23] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:23] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:23] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:23] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:23] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:23] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:23] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:23] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:23] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:28] [ info] [engine] service stopped
[2021/11/09 02:17:28] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:28] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:28] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:28] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:28] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:28] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:28] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:28] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:28] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:33] [ info] [engine] service stopped
[2021/11/09 02:17:33] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:33] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:33] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:33] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:33] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:33] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:33] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:33] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:33] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:38] [ info] [engine] service stopped
[2021/11/09 02:17:38] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:38] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:38] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:38] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:38] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:38] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:38] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:38] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:38] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:43] [ info] [engine] service stopped
[2021/11/09 02:17:43] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:43] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:43] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:43] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:43] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:43] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:43] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:43] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:43] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:48] [ info] [engine] service stopped
[2021/11/09 02:17:48] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:48] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:48] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:48] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:48] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:48] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:48] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:48] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:48] [ warn] [engine] service will stop in 5 seconds
[2021/11/09 02:17:53] [ info] [engine] service stopped
[2021/11/09 02:17:53] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/09 02:17:53] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/09 02:17:53] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/09 02:17:53] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/09 02:17:53] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/09 02:17:53] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 2 pending task(s):
[2021/11/09 02:17:53] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:53] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/09 02:17:53] [ warn] [engine] service will stop in 5 seconds

Your Environment

Additional context

qingling128 commented 2 years ago

Manually reproduced the issue with:

gcloud compute instances create --image-project windows-cloud --image windows-server-2016-dc-v20211012 windows-2016-test-1108-3-round2

// Connect to that VM, then run: 
(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "C:\Users\lingshi\add-google-cloud-ops-agent-repo.ps1")

PS C:\Users\lingshi> $env:REPO_SUFFIX="20211108-3"; .\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall

// To enable debug logging, change C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml to
logging:
  service:
    log_level: debug

// To show fluent bit logs:
notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log

Logs:

[2021/11/10 02:55:05] [ info] [engine] started (pid=4880)
[2021/11/10 02:55:05] [ info] [storage] version=1.1.5, initializing...
[2021/11/10 02:55:05] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/11/10 02:55:05] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/10 02:55:05] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/11/10 02:55:05] [ info] [cmetrics] version=0.2.2
[2021/11/10 02:55:05] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/11/10 02:55:05] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/10 02:55:05] [ info] [sp] stream processor started
[2021/11/10 02:55:17] [error] [upstream] connection #1992 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:02] [error] [upstream] connection #792 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:06] [error] [upstream] connection #2004 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:16] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:56:16] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:56:16] [ warn] [engine] failed to flush chunk '4880-1636512966.889335000.flb', retry in 8 seconds: task_id=3, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/11/10 02:56:16] [ warn] [engine] failed to flush chunk '4880-1636512966.212639200.flb', retry in 6 seconds: task_id=4, input=emitter_for_rewrite_tag.6 > output=stackdriver.0 (out_id=0)
[2021/11/10 02:57:09] [error] [upstream] connection #2112 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:18] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:20] [error] [upstream] connection #2168 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:21] [error] [upstream] connection #2172 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:26] [error] [upstream] connection #2188 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:32] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:35] [error] [upstream] connection #2200 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:42] [error] [upstream] connection #2176 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:44] [error] [upstream] connection #2188 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:47] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:47] [error] [upstream] connection #2192 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:56] [error] [upstream] connection #2152 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:58:02] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:58:02] [ warn] [engine] failed to flush chunk '4880-1636513074.212264300.flb', retry in 9 seconds: task_id=15, input=emitter_for_rewrite_tag.6 > output=stackdriver.0 (out_id=0)
[2021/11/10 02:58:48] [error] [upstream] connection #2236 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:59:17] [error] [upstream] connection #2152 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:59:56] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:01] [ info] [engine] service stopped
[2021/11/10 03:00:01] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:01] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:01] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:01] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:01] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:01] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:01] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:01] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:06] [ info] [engine] service stopped
[2021/11/10 03:00:06] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:06] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:06] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:06] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:06] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:06] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:06] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:06] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:11] [ info] [engine] service stopped
[2021/11/10 03:00:11] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:11] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:11] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:11] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:11] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:11] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:11] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:11] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:16] [ info] [engine] service stopped
[2021/11/10 03:00:16] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:16] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:16] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:16] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:16] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:16] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:16] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:16] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:21] [ info] [engine] service stopped
[2021/11/10 03:00:21] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:21] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:21] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:21] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:21] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:21] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:21] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:21] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:26] [ info] [engine] service stopped
[2021/11/10 03:00:26] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:26] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:26] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:26] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:26] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:26] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:26] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:26] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:31] [ info] [engine] service stopped
[2021/11/10 03:00:31] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:31] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:31] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:31] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:31] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:31] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:31] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:31] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:36] [ info] [engine] service stopped
[2021/11/10 03:00:36] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:36] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:36] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:36] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.1 
[2021/11/10 03:00:36] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:36] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:36] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/10 03:00:36] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:41] [ info] [engine] service stopped
[2021/11/10 03:00:41] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
// After this is just hangs. Forcefully killing the service does not work too.
PS C:\Windows\system32> ping logging.googleapis.com

Pinging logging.googleapis.com [172.217.212.95] with 32 bytes of data:
Reply from 172.217.212.95: bytes=32 time<1ms TTL=115
Reply from 172.217.212.95: bytes=32 time=1ms TTL=115
Reply from 172.217.212.95: bytes=32 time<1ms TTL=118
Reply from 172.217.212.95: bytes=32 time=1ms TTL=115

Ping statistics for 172.217.212.95:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 1ms, Average = 0ms

Enabled debug logging and attached the full log: windows-2016-debug-logs.txt

yangvipguang commented 2 years ago

platform: Centos 7 fluent-bit: 1.8.9 image

I have the same issue as below image

I have to kill -9 PID

qingling128 commented 2 years ago

I'm still seeing this issue after the changes in https://github.com/GoogleCloudPlatform/ops-agent/pull/273 is applied.

Internal Build ID: 95dcfa4d-84ce-4eae-956a-2395d5e57854 Internal Test ID: 2c2d7c76-e941-42ea-ad67-254ebaf52f22

[2021/11/18 06:28:40] [ info] [engine] started (pid=3908)
[2021/11/18 06:28:40] [ info] [storage] version=1.1.5, initializing...
[2021/11/18 06:28:40] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/11/18 06:28:40] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/18 06:28:40] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/11/18 06:28:40] [ info] [cmetrics] version=0.2.2
[2021/11/18 06:28:40] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/11/18 06:28:40] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/11/18 06:28:40] [ info] [sp] stream processor started
[2021/11/18 06:28:42] [ warn] [engine] failed to flush chunk '3908-1637216920.721213400.flb', retry in 8 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/11/18 06:28:54] [error] [upstream] connection #1972 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:00] [error] [upstream] connection #1964 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:00] [error] [upstream] connection #2036 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:12] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:17] [ info] [engine] service stopped
[2021/11/18 06:29:17] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:17] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:17] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:17] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:17] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:17] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:17] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:17] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:22] [ info] [engine] service stopped
[2021/11/18 06:29:22] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:22] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:22] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:22] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:22] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:22] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:22] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:22] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:27] [ info] [engine] service stopped
[2021/11/18 06:29:27] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:27] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:27] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:27] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:27] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:27] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:27] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:27] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:32] [ info] [engine] service stopped
[2021/11/18 06:29:32] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:32] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:32] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:32] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:32] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:32] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:32] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:32] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:37] [ info] [engine] service stopped
[2021/11/18 06:29:37] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:37] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:37] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:37] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:37] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:37] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:37] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:37] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:42] [ info] [engine] service stopped
[2021/11/18 06:29:42] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:42] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:42] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:42] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:42] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:42] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:42] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:42] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:47] [ info] [engine] service stopped
[2021/11/18 06:29:47] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:47] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:47] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:47] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:47] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:47] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:47] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:47] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:52] [ info] [engine] service stopped
[2021/11/18 06:29:52] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:52] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:52] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 06:29:52] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:52] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:52] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 06:29:52] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:52] [ warn] [engine] service will stop in 5 seconds

I'm trying to reproduce it manually with debug log level turned on.

qingling128 commented 2 years ago

Reproduced manually:

  1. Create a Windows 2016 VM
    
    gcloud compute instances create \
    --image-project windows-cloud \
    --image windows-server-2016-dc-v20211012 \
    windows-2016-test-1118

gcloud compute reset-windows-password --quiet \ --project lingshi-sandbox \ --zone us-central1-a \ windows-2016-test-1118

// Commands below are only needed if you connect to Windows VMs via iap gcloud compute firewall-rules create allow-rdp-ingress-from-iap \ --direction=INGRESS \ --action=allow \ --rules=tcp:3389 \ --source-ranges=35.235.240.0/20

gcloud compute start-iap-tunnel \ --project lingshi-sandbox --zone us-central1-a \ --local-host-port=localhost:13010 \ windows-2016-test-1118 3389


2. Connect to the VM and install Ops Agent with Fluent Bit

googet -noconfirm install -sources https://packages.cloud.google.com/yuck/repos/google-cloud-ops-agent-windows-20211118-2 google-cloud-ops-agent


3. Enable debug logging

notepad C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml


Change it to the following

logging: service: log_level: debug


4. Restart the agent to load the config and observe that the agent won't stop even with `-Force`

Restart-Service google-cloud-ops-agent -Force Restart-Service : Time out has expired and the operation has not been completed. At line:1 char:1

  1. Shut down the VM and restart it to force the debug level logging setting to take effect, then check the Fluent Bit log
notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
[2021/11/18 17:50:52] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.0] [logging.googleapis.com/monitored_resource] not found in the payload
[2021/11/18 17:50:52] [debug] [http_client] not using http_proxy for header
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2021/11/18 17:50:52] [debug] [upstream] KA connection #1240 to logging.googleapis.com:443 is now available
[2021/11/18 17:50:52] [debug] [out coro] cb_destroy coro_id=2
[2021/11/18 17:50:52] [debug] [task] destroy task=0000011A0C7F4160 (task_id=11)
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.1] HTTP Status=200
[2021/11/18 17:50:52] [debug] [upstream] KA connection #2040 to logging.googleapis.com:443 is now available
[2021/11/18 17:50:52] [debug] [out coro] cb_destroy coro_id=14
[2021/11/18 17:50:52] [debug] [task] destroy task=0000011A0C7F53F0 (task_id=12)
[2021/11/18 17:50:57] [ info] [engine] service stopped
[2021/11/18 17:50:57] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:50:57] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:50:57] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:50:57] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:50:57] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:50:57] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:50:57] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:50:57] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:50:57] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:50:57] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:50:57] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:50:59] [debug] [input:tail:tail.1] 0 new files found on path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\log/logging-module.log'
[2021/11/18 17:51:02] [ info] [engine] service stopped
[2021/11/18 17:51:02] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:02] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:02] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:02] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:02] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:02] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:02] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:02] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:02] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:02] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:02] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:07] [ info] [engine] service stopped
[2021/11/18 17:51:07] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:07] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:07] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:07] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:07] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:07] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:07] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:07] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:07] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:07] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:07] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:12] [ info] [engine] service stopped
[2021/11/18 17:51:12] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:12] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:12] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:12] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:12] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:12] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:12] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:12] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:12] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:12] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:12] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:15] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [ info] [engine] service stopped
[2021/11/18 17:51:17] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:17] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:17] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:17] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:17] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:17] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:17] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:17] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:17] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:17] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:17] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:18] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:18] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:20] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:20] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:21] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:22] [ info] [engine] service stopped
[2021/11/18 17:51:22] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:22] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:22] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:22] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:22] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:22] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:22] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:22] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:22] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:22] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:22] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:23] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:23] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:27] [ info] [engine] service stopped
[2021/11/18 17:51:27] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:27] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:27] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:27] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:27] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:27] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:27] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.0 
[2021/11/18 17:51:27] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:27] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:27] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/18 17:51:27] [ warn] [engine] service will stop in 5 seconds
jkschulz commented 2 years ago

Tested https://github.com/fluent/fluent-bit/pull/4355 and still seeing the issues.

  1. Built Ops Agent using fluent-bit@3b9a2e7REPO_SUFFIX=20211123-1. Installed on Windows 2016 VM

    PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
    PS C:Windows\system32> $env:REPO_SUFFIX="20211123-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
  2. Restarted agent.

    PS C:\Windows\system32> Restart-Service google-cloud-ops-agent -Force

This completed without the time out expiration once (maybe because I did it in the same minute that I started it?), but after the restart, started seeing these logs (full logs here):

[2021/11/23 03:03:22] [error] [upstream] connection #1924 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:03:22] [error] [io] connect event handler error
[2021/11/23 03:03:22] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:03:32] [error] [upstream] connection #1956 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:03:32] [error] [io] connect event handler error
[2021/11/23 03:03:32] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:04:13] [ warn] [engine] failed to flush chunk '1428-1637636650.493891500.flb', retry in 9 seconds: task_id=2, input=winlog.0 > output=stackdriver.0 (out_id=0)
[2021/11/23 03:06:50] [error] [upstream] connection #704 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:06:50] [error] [io] connect event handler error
[2021/11/23 03:06:50] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:06:52] [error] [upstream] connection #2036 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:06:52] [error] [io] connect event handler error
[2021/11/23 03:06:52] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:01] [error] [upstream] connection #2032 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:01] [error] [io] connect event handler error
[2021/11/23 03:07:01] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:02] [error] [upstream] connection #2052 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:02] [error] [io] connect event handler error
[2021/11/23 03:07:02] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:11] [error] [upstream] connection #2072 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:11] [error] [io] connect event handler error
  1. Tried to stop service. Encountered TimeoutException. Same thing with every Restart-Service attempt after too.
PS C:\Windows\system32> Stop-Service google-cloud-ops-agent -Force
Stop-Service : Time out has expired and the operation has not been completed.
At line:1 char:1
+ Stop-Service google-cloud-ops-agent -Force
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Stop-Service], TimeoutException
    + FullyQualifiedErrorId : System.ServiceProcess.TimeoutException,Microsoft.PowerShell.Commands.StopServiceCommand
  1. Checked logs. They say the service has stopped (full logs here), but it hasn't.
[2021/11/23 03:17:05] [ warn] [engine] service will shutdown in max 5 seconds
[2021/11/23 03:17:06] [ info] [task] winlog/winlog.0 has 14 pending task(s):
[2021/11/23 03:17:06] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=18 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=19 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=21 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=23 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task]   task_id=28 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:06] [ info] [task] tail/tail.1 has 16 pending task(s):
[2021/11/23 03:17:06] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=17 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=20 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=22 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=24 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=25 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=26 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=27 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task]   task_id=29 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:06] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/23 03:17:18] [ info] [task] winlog/winlog.0 has 14 pending task(s):
[2021/11/23 03:17:18] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=3 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=8 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=9 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=11 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=14 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=15 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=16 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=18 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=19 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=21 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=23 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task]   task_id=28 still running on route(s): stackdriver/stackdriver.0 
[2021/11/23 03:17:18] [ info] [task] tail/tail.1 has 16 pending task(s):
[2021/11/23 03:17:18] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=4 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=5 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=6 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=7 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=10 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=12 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=13 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=17 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=20 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=22 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=24 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=25 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=26 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=27 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task]   task_id=29 still running on route(s): stackdriver/stackdriver.1 
[2021/11/23 03:17:18] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/23 03:17:18] [ info] [engine] service has stopped (30 pending tasks)
[2021/11/23 03:17:24] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
  1. Confirmed that the service isn't stopped (stuck in STOP_PENDING).
C:\Windows\system32>sc queryex google-cloud-ops-agent-fluent-bit

SERVICE_NAME: google-cloud-ops-agent-fluent-bit
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 3  STOP_PENDING
                                (STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x7530
        PID                : 1428
        FLAGS              :
  1. Enabled debug logs and restarted VM so it'd take effect. Here are the debug logs after it restarted. I wasn't able to try running Restart-Service or anything because debug mode thrashed VM and I couldn't connect to it again.
jkschulz commented 2 years ago

Tested https://github.com/fluent/fluent-bit/pull/4362 -- still seeing the issues.

  1. Built Ops Agent using branch=upstream-deadlock/hash=fd1f909REPO_SUFFIX=20211124-1. Installed on Windows 2016 VM

    PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
    PS C:Windows\system32> $env:REPO_SUFFIX="20211124-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
  2. Restarted agent. Encountered TimeoutException.

    PS C:\Windows\system32> date
    Wednesday, November 24, 2021 5:14:51 AM
    PS C:\Windows\system32> Restart-Service google-cloud-ops-agent -Force
    Restart-Service : Time out has expired and the operation has not been completed.
    At line:1 char:1
    + Restart-Service google-cloud-ops-agent -Force
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : NotSpecified: (:) [Restart-Service], TimeoutException
        + FullyQualifiedErrorId : System.ServiceProcess.TimeoutException,Microsoft.PowerShell.Commands.RestartServiceCommand
  3. Checked logs. They say the service has stopped but it hasn't.

    [2021/11/24 05:13:52] [ info] [engine] started (pid=1416)
    [2021/11/24 05:13:52] [ info] [storage] version=1.1.5, initializing...
    [2021/11/24 05:13:52] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
    [2021/11/24 05:13:52] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
    [2021/11/24 05:13:52] [ info] [storage] backlog input plugin: storage_backlog.2
    [2021/11/24 05:13:52] [ info] [cmetrics] version=0.2.2
    [2021/11/24 05:13:52] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
    [2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
    [2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #3 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #4 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #5 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #6 started
    [2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #0 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #1 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #2 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #7 started
    [2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
    [2021/11/24 05:13:52] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #2 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #3 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #4 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #5 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #6 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #7 started
    [2021/11/24 05:13:52] [ info] [sp] stream processor started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #0 started
    [2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #1 started
    [2021/11/24 05:14:34] [error] [upstream] connection #2016 to logging.googleapis.com:443 timed out after 10 seconds
    [2021/11/24 05:14:34] [ warn] [net] timeout detected between connection attempts
    [2021/11/24 05:14:44] [error] [upstream] connection #1600 to logging.googleapis.com:443 timed out after 10 seconds
    [2021/11/24 05:14:44] [ warn] [net] timeout detected between connection attempts
    [2021/11/24 05:14:52] [ warn] [engine] service will shutdown in max 5 seconds
    [2021/11/24 05:14:53] [ info] [task] winlog/winlog.0 has 2 pending task(s):
    [2021/11/24 05:14:53] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.0 
    [2021/11/24 05:14:53] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
    [2021/11/24 05:14:53] [ info] [task] tail/tail.1 has 1 pending task(s):
    [2021/11/24 05:14:53] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.1 
    [2021/11/24 05:14:53] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
    [2021/11/24 05:14:55] [ warn] [net] timeout detected between connection attempts
    [2021/11/24 05:14:57] [ info] [task] winlog/winlog.0 has 2 pending task(s):
    [2021/11/24 05:14:57] [ info] [task]   task_id=0 still running on route(s): stackdriver/stackdriver.0 
    [2021/11/24 05:14:57] [ info] [task]   task_id=1 still running on route(s): stackdriver/stackdriver.0 
    [2021/11/24 05:14:57] [ info] [task] tail/tail.1 has 1 pending task(s):
    [2021/11/24 05:14:57] [ info] [task]   task_id=2 still running on route(s): stackdriver/stackdriver.1 
    [2021/11/24 05:14:57] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
    [2021/11/24 05:14:57] [ info] [engine] service has stopped (3 pending tasks)
    [2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
    [2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopped
    [2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopping...
  4. Confirmed that the service isn't stopped (stuck in STOP_PENDING).

    C:\Windows\system32>sc queryex google-cloud-ops-agent-fluent-bit
    
    SERVICE_NAME: google-cloud-ops-agent-fluent-bit
            TYPE               : 10  WIN32_OWN_PROCESS
            STATE              : 3  STOP_PENDING
                                    (STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
            WIN32_EXIT_CODE    : 0  (0x0)
            SERVICE_EXIT_CODE  : 0  (0x0)
            CHECKPOINT         : 0x0
            WAIT_HINT          : 0x7530
            PID                : 1416
            FLAGS              :
jkschulz commented 2 years ago

Tested https://github.com/fluent/fluent-bit/commit/4c93755


Details

  1. Built Ops Agent using branch=leonardo-issue-4300-poc/hash=4c93755REPO_SUFFIX=20211125-1. Installed on Windows 2016 VM

    PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
    PS C:Windows\system32> $env:REPO_SUFFIX="20211125-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
  2. Restarted agent: successful!

    PS C:\Windows\system32> date; Restart-Service google-cloud-ops-agent -Force
    
    Thursday, November 25, 2021 2:19:55 AM
  3. Checked logs. After the restart, logs started growing with messages about connection timeouts and failed to flush chunk.

    • to open log file: notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
      • full logs: https://gist.github.com/jkschulz/65b5303530bdd936f7835fbfcfdde225
        [2021/11/25 02:39:52] [error] [upstream] connection #2044 to logging.googleapis.com:443 timed out after 10 seconds
        [2021/11/25 02:39:52] [ info] [upstream] resuming timed out coroutine=0000018D81384820
        [2021/11/25 02:39:52] [error] [io] connect event handler error
        [2021/11/25 02:39:52] [ warn] [net] timeout detected between connection attempts
        [2021/11/25 02:40:03] [error] [upstream] connection #1960 to logging.googleapis.com:443 timed out after 10 seconds
        [2021/11/25 02:40:03] [ info] [upstream] resuming timed out coroutine=0000018D80E13020
        [2021/11/25 02:40:03] [error] [io] connect event handler error
        [2021/11/25 02:40:03] [ warn] [net] timeout detected between connection attempts
        [2021/11/25 02:40:11] [ warn] [engine] failed to flush chunk '2112-1637808003.58775000.flb', retry in 11 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
        [2021/11/25 02:40:11] [ warn] [engine] failed to flush chunk '2112-1637808001.42068100.flb', retry in 9 seconds: task_id=1, input=winlog.0 > output=stackdriver.0 (out_id=0)
        [2021/11/25 02:41:33] [error] [upstream] connection #2412 to logging.googleapis.com:443 timed out after 10 seconds
  4. Enabled debug logs and restarted agent with cleared state. Debug logs: https://gist.github.com/jkschulz/5cf5ae93eb173b1761fa89698dc6cd4b

    • Changed config (notepad C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml) to:
      logging:
      service:
        log_level: debug
    • Reset agent
      PS C:\Windows\system32> Stop-Service google-cloud-ops-agent -Force
      PS C:\Windows\system32> rm "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"; rm -r "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\"
      PS C:\Windows\system32> date; Restart-Service google-cloud-ops-agent -Force
      Thursday, November 25, 2021 3:17:52 AM
qingling128 commented 2 years ago

A trimmed down version of reproduction:

On a Windows 2016 VM (e.g. gcloud compute instances create --image-project windows-cloud --image windows-server-2016-dc-v20211012 test-vm)

// Clean up previous logs. // Make a buffer file folder. // Make a new log file. // Create a service // Start then restart the service a few times

rm "C:\Program Files\fluent-bit.log"

mkdir "C:\Program Files\buffers"

echo "" > "C:\Program Files\fluent-bit.log"

New-Service -Name "fluentbit" -BinaryPathName '"C:\Program Files\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit.log"'

Start-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit

The Fluent Bit exe binary is built by first checking out Fluent Bit 1.8 branch at 90a3b3938787863f9284392a91d48e604e16aafd then built it from a container:

FROM base as fluentbit
COPY fluent-bit /work/fluent-bit
WORKDIR /work/fluent-bit/build
RUN cmake -G "'Visual Studio 16 2019'" -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DFLB_OUT_KINESIS_STREAMS=OFF ../;
RUN cmake --build . --config Release; `
    Copy-Item -Path bin/Release/fluent-bit.exe -Destination /work/out/bin/; `
    Copy-Item -Path bin/Release/fluent-bit.dll -Destination /work/out/bin/;

C:\Program Files\fluent-bit.conf looks like:

[SERVICE]
    Daemon                    off
    Flush                     1
    HTTP_Listen               0.0.0.0
    HTTP_PORT                 2020
    HTTP_Server               On
    Log_Level                 info
    storage.backlog.mem_limit 50M
    storage.checksum          on
    storage.max_chunks_up     128
    storage.metrics           on
    storage.sync              normal

[INPUT]
    Buffer_Chunk_Size 512k
    Buffer_Max_Size   5M
    DB                "C:\Program Files\buffers\fluent-bit"
    Key               message
    Mem_Buf_Limit     10M
    Name              tail
    Path              "C:\Program Files\fluent-bit.log"
    Read_from_Head    True
    Rotate_Wait       30
    Skip_Long_Lines   On
    Tag               ops-agent-fluent-bit
    storage.type      filesystem

[OUTPUT]
    Match_Regex       ^(ops-agent-fluent-bit)$
    Name              stackdriver
    Retry_Limit       3
    resource          gce_instance
    tls               On
    tls.verify        Off
    workers           8

Sample output looks like:

PS C:\Users\lingshi> rm "C:\Program Files\fluent-bit.log"
>>
>> mkdir "C:\Program Files\buffers"
>>
>> echo "" > "C:\Program Files\fluent-bit.log"
>>
>> New-Service -Name "fluent-bit-service" -BinaryPathName '"C:\Program Files\Google\Cloud Operations\Ops Agent\bin\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit.log"'
>>
>> Start-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service

Status   Name               DisplayName
------   ----               -----------
Stopped  fluent-bit-service fluent-bit-service
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...

One very intersting finding is that if Fluent Bit is not tailing its own log, then the issue does not exist.

rm "C:\Program Files\fluent-bit-a-different-log-file.log"

mkdir "C:\Program Files\buffers"

echo "" > "C:\Program Files\fluent-bit-a-different-log-file.log"

New-Service -Name "fluent-bit-service2323" -BinaryPathName '"C:\Program Files\Google\Cloud Operations\Ops Agent\bin\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit-a-different-log-file.log"'

Start-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
leonardo-albertovich commented 2 years ago

Status update :

In our tests we could not correlate the issue reproduction to the inclussion of the log file as an input, however, what we were able to prove with an acceptable level of certainty was that in order for fluent-bit to fail to shut down it needs to have paused coroutines that are not resumed and that is something that was fixed in 1.9.x with PR #4140 by me.

Additionally, we found that those timeout are caused by fluent-bit trying to connect to an IPv6 host when trying to flush which happens because c-ares returns both IPv4 and IPv6 results (because it detects IPv6 as being operational) and also because it doesn't reorder the results which means even if the queries are issued in a particular order the results are returned in the order in which they arrive.

There was a remaining constraint to be investigated which was why does it only happen in Windows 2016 and in order to validate this a series of instances were created and used to ping on of the IPv6 addresses returned for logging.googleapis.com. The results were :

Based on that we re-tested the previously added DNS resolver selector setting in fluent-bit which resulted in apparent success which means the native resolver does return IP viable addresses which suggests this can be used as an interim solution.

A better solution to retain the benefits of asynchronous DNS lookups would be adding familty type based result prioirization.

quentinmit commented 2 years ago

I believe you need to be doing the happy eyeballs algorithm on every platform; I think it's probably a fluke that you got different behavior on Windows 2016. It's possible for any platform to have non-working IPv6 service that blackholes traffic (and conversely as IPv6 gets more and more adopted that's becoming a problem on IPv4, too). https://en.wikipedia.org/wiki/Happy_Eyeballs

This is normally something that is built-in to an HTTP library.

leonardo-albertovich commented 2 years ago

That sounds interesting @quentinmit, I'm not entirely sure if it could be integrated in the current architecture without substantial changes though.

qingling128 commented 2 years ago

For now, we are using the legacy DNS resolver mode before new mode stables a bit more: https://github.com/GoogleCloudPlatform/ops-agent/pull/290.

The related config is:

[SERVICE]
    dns.resolver legacy

Related PRs to add the config above are:

qingling128 commented 2 years ago

Filed a separate issue for the IPV6 / IPV4 fix: https://github.com/fluent/fluent-bit/issues/4473.

leonardo-albertovich commented 2 years ago

Status update : One of the problems that causes this issue is coroutine hang due to connection timeouts. This is a defect that was corrected in 1.9.x with PRs #4140 and #4193 which upon timeout detection resumes the coroutine referenced by te connection instance if the event type is FLB_ENGINE_EV_THREAD (additionally it ensures the coroutine reference field in the connection instance only holds a coroutine reference if its' expecting to be awoken).

This approach worked well but I wasn't entirely convinced so I went back to analyze the problem hoping to be able to organically generate the socket activity required to trigger an event so the system resumed the coroutine automatically but unfortunately wasn't able to, partially because of how prepare_destroy_conn works but also because any actions performed on the socket at that point are performed in between event loop wait calls (ie. a close / closesocket call would normally be detected by the event loop if performed during a wait cycle but not otherwise).

With organic event generation out of the picture I wrote a proof of concept patch to add event synthesis / injection to the event loop system in order to use that to generate the closest possible to organic coroutine wakeup.

This PoC is available under the leonardo-event-injection-poc branch.

The upside of this method is that because we are not altering the "coroutine resume flow" we don't need to control the coroutine reference in the connection instance.

Opinions / reviews / ideas about the concept and potential solutions would be REALLY appreciated, in case anyone has any questions about the logic behind my ideas please don't hesitate to ask.

edsiper commented 2 years ago

Thanks for the detailed update.

Is there any chance that the event injection gets discarded if an mk_event_wait() is invoked ?

When injecting the event is added as a last item of the list:

https://github.com/fluent/fluent-bit/commit/c5879c787b98d9dc350b9f52f5f94e4631bd60a3#diff-1ac0a5ea810c1bbe9b0667e1bf15f338917c8d627832816daaf9ec80ece11774R314-R330

this is the iteration loop when consuming the list of events mk_event_foreach():

https://github.com/monkey/monkey/blob/2513ddf2dd8a352bc266e3248b1b367ad1fc8333/include/monkey/mk_core/mk_event_epoll.h#L31-L43

I am also thinking that this is quite similar to the other cleanup mechanism when processing something after the event loop round has finished:

https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L804-L807

should we implement a cleanup function instead ?

leonardo-albertovich commented 2 years ago

Any type of events are discarded when mk_event_wait is called.

The injected events are consumed on the same cycle they are injected on, at least with the standard compilation flags (in some specific cases when a variable is modified outside of the scope where it's being consumed the volatile attribute needs to be used on the variable definition to prevent the compiler from optimizing the memory read but that's not the case hre)

Adding a cleanup step where all of them are resumed would be the same as what's currently implemented which depends on the rest of the code playing nice, this alternative approach is meant to eliminate that risk.

edsiper commented 2 years ago

@leonardo-albertovich is there any downside of this implementation ?, if there is no one, is it possible to write a unit test for the problem in question ?

leonardo-albertovich commented 2 years ago

I think I may be able to put together a unit test, I'll get to it immediately.

leonardo-albertovich commented 2 years ago

I made PR #4605 with the test case, what the test does is :

Set up fluent bit with a grace period of 20 seconds, a dummy input plugin capped to one sample and a tcp output pointed to 128.1.1.1 (which could be improved) and a connection timeout of 5 seconds.

Then it starts the server, waits 10 seconds (to give it time to flush and time out while being safely under the grace period) and stops the server.

When the server stops the elapsed time from start to stop is measured and if it's higher than 15 seconds it assumes that the grace period was enforced.

In order to test this one would have to manually patch flb_upstream_conn_timeouts because at the moment master includes my original fix that resumes timed out coroutines.

edsiper commented 2 years ago

for #4605 make sure to submit the patch for upstream interface so the CI can pass

leonardo-albertovich commented 2 years ago

What does that mean? (the upstream interface thing) I just saw that ubuntu-18.04 failed the test with the patch so I'll tweak the timing a bit to see what exactly the issue is related to.