Closed qingling128 closed 2 years ago
Manually reproduced the issue with:
gcloud compute instances create --image-project windows-cloud --image windows-server-2016-dc-v20211012 windows-2016-test-1108-3-round2
// Connect to that VM, then run:
(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "C:\Users\lingshi\add-google-cloud-ops-agent-repo.ps1")
PS C:\Users\lingshi> $env:REPO_SUFFIX="20211108-3"; .\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
// To enable debug logging, change C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml to
logging:
service:
log_level: debug
// To show fluent bit logs:
notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
Logs:
[2021/11/10 02:55:05] [ info] [engine] started (pid=4880)
[2021/11/10 02:55:05] [ info] [storage] version=1.1.5, initializing...
[2021/11/10 02:55:05] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/11/10 02:55:05] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/10 02:55:05] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/11/10 02:55:05] [ info] [cmetrics] version=0.2.2
[2021/11/10 02:55:05] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/11/10 02:55:05] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/11/10 02:55:05] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/11/10 02:55:05] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/10 02:55:05] [ info] [sp] stream processor started
[2021/11/10 02:55:17] [error] [upstream] connection #1992 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:02] [error] [upstream] connection #792 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:06] [error] [upstream] connection #2004 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:56:16] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:56:16] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:56:16] [ warn] [engine] failed to flush chunk '4880-1636512966.889335000.flb', retry in 8 seconds: task_id=3, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/11/10 02:56:16] [ warn] [engine] failed to flush chunk '4880-1636512966.212639200.flb', retry in 6 seconds: task_id=4, input=emitter_for_rewrite_tag.6 > output=stackdriver.0 (out_id=0)
[2021/11/10 02:57:09] [error] [upstream] connection #2112 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:18] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:20] [error] [upstream] connection #2168 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:21] [error] [upstream] connection #2172 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:26] [error] [upstream] connection #2188 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:32] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:35] [error] [upstream] connection #2200 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:42] [error] [upstream] connection #2176 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:44] [error] [upstream] connection #2188 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:47] [error] [upstream] connection #2124 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:47] [error] [upstream] connection #2192 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:57:56] [error] [upstream] connection #2152 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:58:02] [error] [tls] C:\work\submodules\fluent-bit\src\tls\mbedtls.c:380 NET - Sending information through the socket failed
[2021/11/10 02:58:02] [ warn] [engine] failed to flush chunk '4880-1636513074.212264300.flb', retry in 9 seconds: task_id=15, input=emitter_for_rewrite_tag.6 > output=stackdriver.0 (out_id=0)
[2021/11/10 02:58:48] [error] [upstream] connection #2236 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:59:17] [error] [upstream] connection #2152 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/10 02:59:56] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:01] [ info] [engine] service stopped
[2021/11/10 03:00:01] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:01] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:01] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:01] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:01] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:01] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:01] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:01] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:06] [ info] [engine] service stopped
[2021/11/10 03:00:06] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:06] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:06] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:06] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:06] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:06] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:06] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:06] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:11] [ info] [engine] service stopped
[2021/11/10 03:00:11] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:11] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:11] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:11] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:11] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:11] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:11] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:11] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:16] [ info] [engine] service stopped
[2021/11/10 03:00:16] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:16] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:16] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:16] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:16] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:16] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:16] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:16] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:21] [ info] [engine] service stopped
[2021/11/10 03:00:21] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:21] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:21] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:21] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:21] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:21] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:21] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:21] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:26] [ info] [engine] service stopped
[2021/11/10 03:00:26] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:26] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:26] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:26] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:26] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:26] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:26] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:26] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:31] [ info] [engine] service stopped
[2021/11/10 03:00:31] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:31] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:31] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:31] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:31] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:31] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:31] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:31] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:36] [ info] [engine] service stopped
[2021/11/10 03:00:36] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/10 03:00:36] [ info] [task] winlog/winlog.0 has 0 pending task(s):
[2021/11/10 03:00:36] [ info] [task] tail/tail.1 has 6 pending task(s):
[2021/11/10 03:00:36] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.1
[2021/11/10 03:00:36] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/10 03:00:36] [ info] [task] emitter/emitter_for_rewrite_tag.6 has 11 pending task(s):
[2021/11/10 03:00:36] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/10 03:00:36] [ warn] [engine] service will stop in 5 seconds
[2021/11/10 03:00:41] [ info] [engine] service stopped
[2021/11/10 03:00:41] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
// After this is just hangs. Forcefully killing the service does not work too.
PS C:\Windows\system32> ping logging.googleapis.com
Pinging logging.googleapis.com [172.217.212.95] with 32 bytes of data:
Reply from 172.217.212.95: bytes=32 time<1ms TTL=115
Reply from 172.217.212.95: bytes=32 time=1ms TTL=115
Reply from 172.217.212.95: bytes=32 time<1ms TTL=118
Reply from 172.217.212.95: bytes=32 time=1ms TTL=115
Ping statistics for 172.217.212.95:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 1ms, Average = 0ms
Enabled debug logging and attached the full log: windows-2016-debug-logs.txt
platform: Centos 7 fluent-bit: 1.8.9
I have the same issue as below
I have to kill -9 PID
I'm still seeing this issue after the changes in https://github.com/GoogleCloudPlatform/ops-agent/pull/273 is applied.
Internal Build ID: 95dcfa4d-84ce-4eae-956a-2395d5e57854
Internal Test ID: 2c2d7c76-e941-42ea-ad67-254ebaf52f22
[2021/11/18 06:28:40] [ info] [engine] started (pid=3908)
[2021/11/18 06:28:40] [ info] [storage] version=1.1.5, initializing...
[2021/11/18 06:28:40] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/11/18 06:28:40] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/18 06:28:40] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/11/18 06:28:40] [ info] [cmetrics] version=0.2.2
[2021/11/18 06:28:40] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/11/18 06:28:40] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/11/18 06:28:40] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/11/18 06:28:40] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/11/18 06:28:40] [ info] [sp] stream processor started
[2021/11/18 06:28:42] [ warn] [engine] failed to flush chunk '3908-1637216920.721213400.flb', retry in 8 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/11/18 06:28:54] [error] [upstream] connection #1972 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:00] [error] [upstream] connection #1964 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:00] [error] [upstream] connection #2036 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/18 06:29:12] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:17] [ info] [engine] service stopped
[2021/11/18 06:29:17] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:17] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:17] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:17] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:17] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:17] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:17] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:17] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:22] [ info] [engine] service stopped
[2021/11/18 06:29:22] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:22] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:22] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:22] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:22] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:22] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:22] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:22] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:27] [ info] [engine] service stopped
[2021/11/18 06:29:27] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:27] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:27] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:27] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:27] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:27] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:27] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:27] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:32] [ info] [engine] service stopped
[2021/11/18 06:29:32] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:32] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:32] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:32] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:32] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:32] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:32] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:32] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:37] [ info] [engine] service stopped
[2021/11/18 06:29:37] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:37] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:37] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:37] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:37] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:37] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:37] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:37] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:42] [ info] [engine] service stopped
[2021/11/18 06:29:42] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:42] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:42] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:42] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:42] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:42] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:42] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:42] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:47] [ info] [engine] service stopped
[2021/11/18 06:29:47] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:47] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:47] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:47] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:47] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:47] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:47] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:47] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 06:29:52] [ info] [engine] service stopped
[2021/11/18 06:29:52] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 06:29:52] [ info] [task] winlog/winlog.0 has 1 pending task(s):
[2021/11/18 06:29:52] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 06:29:52] [ info] [task] tail/tail.1 has 2 pending task(s):
[2021/11/18 06:29:52] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:52] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 06:29:52] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/18 06:29:52] [ warn] [engine] service will stop in 5 seconds
I'm trying to reproduce it manually with debug log level turned on.
Reproduced manually:
gcloud compute instances create \
--image-project windows-cloud \
--image windows-server-2016-dc-v20211012 \
windows-2016-test-1118
gcloud compute reset-windows-password --quiet \ --project lingshi-sandbox \ --zone us-central1-a \ windows-2016-test-1118
// Commands below are only needed if you connect to Windows VMs via iap gcloud compute firewall-rules create allow-rdp-ingress-from-iap \ --direction=INGRESS \ --action=allow \ --rules=tcp:3389 \ --source-ranges=35.235.240.0/20
gcloud compute start-iap-tunnel \ --project lingshi-sandbox --zone us-central1-a \ --local-host-port=localhost:13010 \ windows-2016-test-1118 3389
2. Connect to the VM and install Ops Agent with Fluent Bit
googet -noconfirm install -sources https://packages.cloud.google.com/yuck/repos/google-cloud-ops-agent-windows-20211118-2 google-cloud-ops-agent
3. Enable debug logging
notepad C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml
Change it to the following
logging: service: log_level: debug
4. Restart the agent to load the config and observe that the agent won't stop even with `-Force`
Restart-Service google-cloud-ops-agent -Force Restart-Service : Time out has expired and the operation has not been completed. At line:1 char:1
+ CategoryInfo : NotSpecified: (:) [Restart-Service], TimeoutException
+ FullyQualifiedErrorId : System.ServiceProcess.TimeoutException,Microsoft.PowerShell.Commands.RestartServiceCommand
```
notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
[2021/11/18 17:50:52] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.0] [logging.googleapis.com/monitored_resource] not found in the payload
[2021/11/18 17:50:52] [debug] [http_client] not using http_proxy for header
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2021/11/18 17:50:52] [debug] [upstream] KA connection #1240 to logging.googleapis.com:443 is now available
[2021/11/18 17:50:52] [debug] [out coro] cb_destroy coro_id=2
[2021/11/18 17:50:52] [debug] [task] destroy task=0000011A0C7F4160 (task_id=11)
[2021/11/18 17:50:52] [debug] [output:stackdriver:stackdriver.1] HTTP Status=200
[2021/11/18 17:50:52] [debug] [upstream] KA connection #2040 to logging.googleapis.com:443 is now available
[2021/11/18 17:50:52] [debug] [out coro] cb_destroy coro_id=14
[2021/11/18 17:50:52] [debug] [task] destroy task=0000011A0C7F53F0 (task_id=12)
[2021/11/18 17:50:57] [ info] [engine] service stopped
[2021/11/18 17:50:57] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:50:57] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:50:57] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:50:57] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:50:57] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:50:57] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:50:57] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:50:57] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:50:57] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:50:57] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:50:57] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:50:59] [debug] [input:tail:tail.1] 0 new files found on path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\log/logging-module.log'
[2021/11/18 17:51:02] [ info] [engine] service stopped
[2021/11/18 17:51:02] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:02] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:02] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:02] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:02] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:02] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:02] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:02] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:02] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:02] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:02] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:07] [ info] [engine] service stopped
[2021/11/18 17:51:07] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:07] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:07] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:07] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:07] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:07] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:07] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:07] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:07] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:07] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:07] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:12] [ info] [engine] service stopped
[2021/11/18 17:51:12] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:12] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:12] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:12] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:12] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:12] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:12] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:12] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:12] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:12] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:12] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:15] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:17] [ info] [engine] service stopped
[2021/11/18 17:51:17] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:17] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:17] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:17] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:17] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:17] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:17] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:17] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:17] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:17] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:17] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:18] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:18] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:20] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:20] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:21] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:22] [ info] [engine] service stopped
[2021/11/18 17:51:22] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:22] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:22] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:22] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:22] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:22] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:22] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:22] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:22] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:22] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:22] [ warn] [engine] service will stop in 5 seconds
[2021/11/18 17:51:23] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:23] [debug] [upstream] drop keepalive connection #-1 to logging.googleapis.com:443 (keepalive idle timeout)
[2021/11/18 17:51:27] [ info] [engine] service stopped
[2021/11/18 17:51:27] [ warn] [engine] shutdown delayed, grace period has finished but some tasks are still running.
[2021/11/18 17:51:27] [ info] [task] winlog/winlog.0 has 5 pending task(s):
[2021/11/18 17:51:27] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:27] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:27] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:27] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:27] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.0
[2021/11/18 17:51:27] [ info] [task] tail/tail.1 has 5 pending task(s):
[2021/11/18 17:51:27] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ info] [task] storage_backlog/storage_backlog.2 has 1 pending task(s):
[2021/11/18 17:51:27] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/18 17:51:27] [ warn] [engine] service will stop in 5 seconds
Tested https://github.com/fluent/fluent-bit/pull/4355 and still seeing the issues.
Built Ops Agent using fluent-bit@3b9a2e7 → REPO_SUFFIX=20211123-1
. Installed on Windows 2016 VM
PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
PS C:Windows\system32> $env:REPO_SUFFIX="20211123-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
Restarted agent.
PS C:\Windows\system32> Restart-Service google-cloud-ops-agent -Force
This completed without the time out expiration once (maybe because I did it in the same minute that I started it?), but after the restart, started seeing these logs (full logs here):
[2021/11/23 03:03:22] [error] [upstream] connection #1924 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:03:22] [error] [io] connect event handler error
[2021/11/23 03:03:22] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:03:32] [error] [upstream] connection #1956 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:03:32] [error] [io] connect event handler error
[2021/11/23 03:03:32] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:04:13] [ warn] [engine] failed to flush chunk '1428-1637636650.493891500.flb', retry in 9 seconds: task_id=2, input=winlog.0 > output=stackdriver.0 (out_id=0)
[2021/11/23 03:06:50] [error] [upstream] connection #704 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:06:50] [error] [io] connect event handler error
[2021/11/23 03:06:50] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:06:52] [error] [upstream] connection #2036 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:06:52] [error] [io] connect event handler error
[2021/11/23 03:06:52] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:01] [error] [upstream] connection #2032 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:01] [error] [io] connect event handler error
[2021/11/23 03:07:01] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:02] [error] [upstream] connection #2052 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:02] [error] [io] connect event handler error
[2021/11/23 03:07:02] [ warn] [net] timeout detected between connection attempts
[2021/11/23 03:07:11] [error] [upstream] connection #2072 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/23 03:07:11] [error] [io] connect event handler error
Restart-Service
attempt after too.PS C:\Windows\system32> Stop-Service google-cloud-ops-agent -Force
Stop-Service : Time out has expired and the operation has not been completed.
At line:1 char:1
+ Stop-Service google-cloud-ops-agent -Force
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Stop-Service], TimeoutException
+ FullyQualifiedErrorId : System.ServiceProcess.TimeoutException,Microsoft.PowerShell.Commands.StopServiceCommand
[2021/11/23 03:17:05] [ warn] [engine] service will shutdown in max 5 seconds
[2021/11/23 03:17:06] [ info] [task] winlog/winlog.0 has 14 pending task(s):
[2021/11/23 03:17:06] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=18 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=19 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=21 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=23 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] task_id=28 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:06] [ info] [task] tail/tail.1 has 16 pending task(s):
[2021/11/23 03:17:06] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=17 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=20 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=22 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=24 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=25 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=26 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=27 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] task_id=29 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:06] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/23 03:17:18] [ info] [task] winlog/winlog.0 has 14 pending task(s):
[2021/11/23 03:17:18] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=3 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=8 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=9 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=11 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=14 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=15 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=16 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=18 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=19 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=21 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=23 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] task_id=28 still running on route(s): stackdriver/stackdriver.0
[2021/11/23 03:17:18] [ info] [task] tail/tail.1 has 16 pending task(s):
[2021/11/23 03:17:18] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=4 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=5 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=6 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=7 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=10 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=12 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=13 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=17 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=20 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=22 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=24 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=25 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=26 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=27 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] task_id=29 still running on route(s): stackdriver/stackdriver.1
[2021/11/23 03:17:18] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/23 03:17:18] [ info] [engine] service has stopped (30 pending tasks)
[2021/11/23 03:17:24] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
STOP_PENDING
).C:\Windows\system32>sc queryex google-cloud-ops-agent-fluent-bit
SERVICE_NAME: google-cloud-ops-agent-fluent-bit
TYPE : 10 WIN32_OWN_PROCESS
STATE : 3 STOP_PENDING
(STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x7530
PID : 1428
FLAGS :
Restart-Service
or anything because debug mode thrashed VM and I couldn't connect to it again. Tested https://github.com/fluent/fluent-bit/pull/4362 -- still seeing the issues.
Built Ops Agent using branch=upstream-deadlock
/hash=fd1f909
→ REPO_SUFFIX=20211124-1
. Installed on Windows 2016 VM
PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
PS C:Windows\system32> $env:REPO_SUFFIX="20211124-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
Restarted agent. Encountered TimeoutException.
PS C:\Windows\system32> date
Wednesday, November 24, 2021 5:14:51 AM
PS C:\Windows\system32> Restart-Service google-cloud-ops-agent -Force
Restart-Service : Time out has expired and the operation has not been completed.
At line:1 char:1
+ Restart-Service google-cloud-ops-agent -Force
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Restart-Service], TimeoutException
+ FullyQualifiedErrorId : System.ServiceProcess.TimeoutException,Microsoft.PowerShell.Commands.RestartServiceCommand
Checked logs. They say the service has stopped but it hasn't.
[2021/11/24 05:13:52] [ info] [engine] started (pid=1416)
[2021/11/24 05:13:52] [ info] [storage] version=1.1.5, initializing...
[2021/11/24 05:13:52] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/11/24 05:13:52] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/24 05:13:52] [ info] [storage] backlog input plugin: storage_backlog.2
[2021/11/24 05:13:52] [ info] [cmetrics] version=0.2.2
[2021/11/24 05:13:52] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/11/24 05:13:52] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/11/24 05:13:52] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/11/24 05:13:52] [ info] [sp] stream processor started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/11/24 05:13:52] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/11/24 05:14:34] [error] [upstream] connection #2016 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/24 05:14:34] [ warn] [net] timeout detected between connection attempts
[2021/11/24 05:14:44] [error] [upstream] connection #1600 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/24 05:14:44] [ warn] [net] timeout detected between connection attempts
[2021/11/24 05:14:52] [ warn] [engine] service will shutdown in max 5 seconds
[2021/11/24 05:14:53] [ info] [task] winlog/winlog.0 has 2 pending task(s):
[2021/11/24 05:14:53] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.0
[2021/11/24 05:14:53] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/24 05:14:53] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/24 05:14:53] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.1
[2021/11/24 05:14:53] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/24 05:14:55] [ warn] [net] timeout detected between connection attempts
[2021/11/24 05:14:57] [ info] [task] winlog/winlog.0 has 2 pending task(s):
[2021/11/24 05:14:57] [ info] [task] task_id=0 still running on route(s): stackdriver/stackdriver.0
[2021/11/24 05:14:57] [ info] [task] task_id=1 still running on route(s): stackdriver/stackdriver.0
[2021/11/24 05:14:57] [ info] [task] tail/tail.1 has 1 pending task(s):
[2021/11/24 05:14:57] [ info] [task] task_id=2 still running on route(s): stackdriver/stackdriver.1
[2021/11/24 05:14:57] [ info] [task] storage_backlog/storage_backlog.2 has 0 pending task(s):
[2021/11/24 05:14:57] [ info] [engine] service has stopped (3 pending tasks)
[2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
[2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopped
[2021/11/24 05:14:57] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopping...
Confirmed that the service isn't stopped (stuck in STOP_PENDING
).
C:\Windows\system32>sc queryex google-cloud-ops-agent-fluent-bit
SERVICE_NAME: google-cloud-ops-agent-fluent-bit
TYPE : 10 WIN32_OWN_PROCESS
STATE : 3 STOP_PENDING
(STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x7530
PID : 1416
FLAGS :
Tested https://github.com/fluent/fluent-bit/commit/4c93755
failed to flush chunk
in logs after restart. debug logs hereBuilt Ops Agent using branch=leonardo-issue-4300-poc
/hash=4c93755
→ REPO_SUFFIX=20211125-1
. Installed on Windows 2016 VM
PS C:Windows\system32> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
PS C:Windows\system32> $env:REPO_SUFFIX="20211125-1"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
Restarted agent: successful!
PS C:\Windows\system32> date; Restart-Service google-cloud-ops-agent -Force
Thursday, November 25, 2021 2:19:55 AM
Checked logs. After the restart, logs started growing with messages about connection timeouts and failed to flush chunk
.
notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
[2021/11/25 02:39:52] [error] [upstream] connection #2044 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/25 02:39:52] [ info] [upstream] resuming timed out coroutine=0000018D81384820
[2021/11/25 02:39:52] [error] [io] connect event handler error
[2021/11/25 02:39:52] [ warn] [net] timeout detected between connection attempts
[2021/11/25 02:40:03] [error] [upstream] connection #1960 to logging.googleapis.com:443 timed out after 10 seconds
[2021/11/25 02:40:03] [ info] [upstream] resuming timed out coroutine=0000018D80E13020
[2021/11/25 02:40:03] [error] [io] connect event handler error
[2021/11/25 02:40:03] [ warn] [net] timeout detected between connection attempts
[2021/11/25 02:40:11] [ warn] [engine] failed to flush chunk '2112-1637808003.58775000.flb', retry in 11 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/11/25 02:40:11] [ warn] [engine] failed to flush chunk '2112-1637808001.42068100.flb', retry in 9 seconds: task_id=1, input=winlog.0 > output=stackdriver.0 (out_id=0)
[2021/11/25 02:41:33] [error] [upstream] connection #2412 to logging.googleapis.com:443 timed out after 10 seconds
Enabled debug logs and restarted agent with cleared state. Debug logs: https://gist.github.com/jkschulz/5cf5ae93eb173b1761fa89698dc6cd4b
notepad C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml
) to:
logging:
service:
log_level: debug
PS C:\Windows\system32> Stop-Service google-cloud-ops-agent -Force
PS C:\Windows\system32> rm "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"; rm -r "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\"
PS C:\Windows\system32> date; Restart-Service google-cloud-ops-agent -Force
Thursday, November 25, 2021 3:17:52 AM
A trimmed down version of reproduction:
On a Windows 2016 VM (e.g. gcloud compute instances create --image-project windows-cloud --image windows-server-2016-dc-v20211012 test-vm
)
// Clean up previous logs. // Make a buffer file folder. // Make a new log file. // Create a service // Start then restart the service a few times
rm "C:\Program Files\fluent-bit.log"
mkdir "C:\Program Files\buffers"
echo "" > "C:\Program Files\fluent-bit.log"
New-Service -Name "fluentbit" -BinaryPathName '"C:\Program Files\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit.log"'
Start-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit
Restart-Service fluentbit
The Fluent Bit exe binary is built by first checking out Fluent Bit 1.8
branch at 90a3b3938787863f9284392a91d48e604e16aafd
then built it from a container:
FROM base as fluentbit
COPY fluent-bit /work/fluent-bit
WORKDIR /work/fluent-bit/build
RUN cmake -G "'Visual Studio 16 2019'" -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DFLB_OUT_KINESIS_STREAMS=OFF ../;
RUN cmake --build . --config Release; `
Copy-Item -Path bin/Release/fluent-bit.exe -Destination /work/out/bin/; `
Copy-Item -Path bin/Release/fluent-bit.dll -Destination /work/out/bin/;
C:\Program Files\fluent-bit.conf looks like:
[SERVICE]
Daemon off
Flush 1
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
HTTP_Server On
Log_Level info
storage.backlog.mem_limit 50M
storage.checksum on
storage.max_chunks_up 128
storage.metrics on
storage.sync normal
[INPUT]
Buffer_Chunk_Size 512k
Buffer_Max_Size 5M
DB "C:\Program Files\buffers\fluent-bit"
Key message
Mem_Buf_Limit 10M
Name tail
Path "C:\Program Files\fluent-bit.log"
Read_from_Head True
Rotate_Wait 30
Skip_Long_Lines On
Tag ops-agent-fluent-bit
storage.type filesystem
[OUTPUT]
Match_Regex ^(ops-agent-fluent-bit)$
Name stackdriver
Retry_Limit 3
resource gce_instance
tls On
tls.verify Off
workers 8
Sample output looks like:
PS C:\Users\lingshi> rm "C:\Program Files\fluent-bit.log"
>>
>> mkdir "C:\Program Files\buffers"
>>
>> echo "" > "C:\Program Files\fluent-bit.log"
>>
>> New-Service -Name "fluent-bit-service" -BinaryPathName '"C:\Program Files\Google\Cloud Operations\Ops Agent\bin\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit.log"'
>>
>> Start-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service
>> Restart-Service fluent-bit-service
Status Name DisplayName
------ ---- -----------
Stopped fluent-bit-service fluent-bit-service
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
WARNING: Waiting for service 'fluent-bit-service (fluent-bit-service)' to stop...
One very intersting finding is that if Fluent Bit is not tailing its own log, then the issue does not exist.
rm "C:\Program Files\fluent-bit-a-different-log-file.log"
mkdir "C:\Program Files\buffers"
echo "" > "C:\Program Files\fluent-bit-a-different-log-file.log"
New-Service -Name "fluent-bit-service2323" -BinaryPathName '"C:\Program Files\Google\Cloud Operations\Ops Agent\bin\fluent-bit.exe" -c "C:\Program Files\fluent-bit.conf" --storage_path "C:\Program Files\buffers" --log_file "C:\Program Files\fluent-bit-a-different-log-file.log"'
Start-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Restart-Service fluent-bit-service2323
Status update :
In our tests we could not correlate the issue reproduction to the inclussion of the log file as an input, however, what we were able to prove with an acceptable level of certainty was that in order for fluent-bit to fail to shut down it needs to have paused coroutines that are not resumed and that is something that was fixed in 1.9.x with PR #4140 by me.
Additionally, we found that those timeout are caused by fluent-bit trying to connect to an IPv6 host when trying to flush which happens because c-ares returns both IPv4 and IPv6 results (because it detects IPv6 as being operational) and also because it doesn't reorder the results which means even if the queries are issued in a particular order the results are returned in the order in which they arrive.
There was a remaining constraint to be investigated which was why does it only happen in Windows 2016 and in order to validate this a series of instances were created and used to ping on of the IPv6 addresses returned for logging.googleapis.com
. The results were :
Based on that we re-tested the previously added DNS resolver selector setting in fluent-bit which resulted in apparent success which means the native resolver does return IP viable addresses which suggests this can be used as an interim solution.
A better solution to retain the benefits of asynchronous DNS lookups would be adding familty type based result prioirization.
I believe you need to be doing the happy eyeballs algorithm on every platform; I think it's probably a fluke that you got different behavior on Windows 2016. It's possible for any platform to have non-working IPv6 service that blackholes traffic (and conversely as IPv6 gets more and more adopted that's becoming a problem on IPv4, too). https://en.wikipedia.org/wiki/Happy_Eyeballs
This is normally something that is built-in to an HTTP library.
That sounds interesting @quentinmit, I'm not entirely sure if it could be integrated in the current architecture without substantial changes though.
For now, we are using the legacy DNS resolver mode before new mode stables a bit more: https://github.com/GoogleCloudPlatform/ops-agent/pull/290.
The related config is:
[SERVICE]
dns.resolver legacy
Related PRs to add the config above are:
Filed a separate issue for the IPV6 / IPV4 fix: https://github.com/fluent/fluent-bit/issues/4473.
Status update : One of the problems that causes this issue is coroutine hang due to connection timeouts. This is a defect that was corrected in 1.9.x with PRs #4140 and #4193 which upon timeout detection resumes the coroutine referenced by te connection instance if the event type is FLB_ENGINE_EV_THREAD
(additionally it ensures the coroutine reference field in the connection instance only holds a coroutine reference if its' expecting to be awoken).
This approach worked well but I wasn't entirely convinced so I went back to analyze the problem hoping to be able to organically generate the socket activity required to trigger an event so the system resumed the coroutine automatically but unfortunately wasn't able to, partially because of how prepare_destroy_conn
works but also because any actions performed on the socket at that point are performed in between event loop wait calls (ie. a close
/ closesocket
call would normally be detected by the event loop if performed during a wait cycle but not otherwise).
With organic event generation out of the picture I wrote a proof of concept patch to add event synthesis / injection to the event loop system in order to use that to generate the closest possible to organic coroutine wakeup.
This PoC is available under the leonardo-event-injection-poc branch.
The upside of this method is that because we are not altering the "coroutine resume flow" we don't need to control the coroutine reference in the connection instance.
Opinions / reviews / ideas about the concept and potential solutions would be REALLY appreciated, in case anyone has any questions about the logic behind my ideas please don't hesitate to ask.
Thanks for the detailed update.
Is there any chance that the event injection gets discarded if an mk_event_wait() is invoked ?
When injecting the event is added as a last item of the list:
this is the iteration loop when consuming the list of events mk_event_foreach():
I am also thinking that this is quite similar to the other cleanup mechanism when processing something after the event loop round has finished:
https://github.com/fluent/fluent-bit/blob/master/src/flb_engine.c#L804-L807
should we implement a cleanup function instead ?
Any type of events are discarded when mk_event_wait
is called.
The injected events are consumed on the same cycle they are injected on, at least with the standard compilation flags (in some specific cases when a variable is modified outside of the scope where it's being consumed the volatile attribute needs to be used on the variable definition to prevent the compiler from optimizing the memory read but that's not the case hre)
Adding a cleanup step where all of them are resumed would be the same as what's currently implemented which depends on the rest of the code playing nice, this alternative approach is meant to eliminate that risk.
@leonardo-albertovich is there any downside of this implementation ?, if there is no one, is it possible to write a unit test for the problem in question ?
I think I may be able to put together a unit test, I'll get to it immediately.
I made PR #4605 with the test case, what the test does is :
Set up fluent bit with a grace period of 20 seconds, a dummy input plugin capped to one sample and a tcp output pointed to 128.1.1.1 (which could be improved) and a connection timeout of 5 seconds.
Then it starts the server, waits 10 seconds (to give it time to flush and time out while being safely under the grace period) and stops the server.
When the server stops the elapsed time from start to stop is measured and if it's higher than 15 seconds it assumes that the grace period was enforced.
In order to test this one would have to manually patch flb_upstream_conn_timeouts
because at the moment master includes my original fix that resumes timed out coroutines.
for #4605 make sure to submit the patch for upstream interface so the CI can pass
What does that mean? (the upstream interface thing) I just saw that ubuntu-18.04 failed the test with the patch so I'll tweak the timing a bit to see what exactly the issue is related to.
Bug Report
Describe the bug When stopping / restarting Fluent Bit service, occasionally it enters a flush loop and hangs (In the experiment we observed it for 10+ minutes. It is still trying to flush log entries, even though we leave the grace period as 5 seconds. For the context, we are using file based buffering, so
flush_at_shutdown
should default to false, which means it is not expected for Fluent Bit to try flushing log entries prior to shutdown.Expected behavior
Fluent Bit stops within a reasonable time (aka close to the
grace
config value) after getting
service stop` signal.Sample logs
Your Environment
Additional context