Closed jsubirat closed 3 years ago
We are seeing a very similar issue. We did not clear Application logs though. We see this with both https://github.com/fluent/fluent-bit/tree/v1.7.0-rc3 and https://github.com/fluent/fluent-bit/tree/v1.7.0-rc4. Logs from other channels arrived just fine, but not with Application
channels.
[2021/02/06 00:08:56] [ info] [engine] started (pid=4920)
[2021/02/06 00:08:56] [ info] [storage] version=1.1.0, initializing...
[2021/02/06 00:08:56] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/02/06 00:08:56] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/02/06 00:08:56] [ info] [storage] backlog input plugin: storage_backlog.3
[2021/02/06 00:08:56] [ info] [input:storage_backlog:storage_backlog.3] queue memory limit: 47.7M
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.2] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.2] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.2] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [sp] stream processor started
[2021/02/06 00:08:57] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:57] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:08:58] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:58] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:08:59] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:59] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:00] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:00] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:01] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:01] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:02] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:02] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:03] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:03] [error] [input:winlog:winlog.2] failed to read 'Application'
The related config we use is:
[SERVICE]
# https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file#config_section
# Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay.
Flush 1
# We use systemd to manage Fluent Bit instead.
Daemon off
# Log_File is set by Fluent Bit systemd unit (e.g. /var/log/google-cloud-ops-agent/subagents/logging-module.log).
Log_Level info
# https://docs.fluentbit.io/manual/administration/monitoring
# Enable a built-in HTTP server that can be used to query internal information and monitor metrics of each running plugin.
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
# https://docs.fluentbit.io/manual/administration/buffering-and-storage#service-section-configuration
# storage.path is set by Fluent Bit systemd unit (e.g. /var/lib/google-cloud-ops-agent/fluent-bit/buffers).
storage.sync normal
# Enable the data integrity check when writing and reading data from the filesystem.
storage.checksum on
# The maximum amount of data to load into the memory when processing old chunks from the backlog that is from previous Fluent Bit processes (e.g. Fluent Bit may have crashed or restarted).
storage.backlog.mem_limit 50M
# Enable storage metrics in the built-in HTTP server.
storage.metrics on
# This is exclusive to filesystem storage type. It specifies the number of chunks (every chunk is a file) that can be up in memory.
# Every chunk is a file, so having it up in memory means having an open file descriptor. In case there are thousands of chunks,
# we don't want them to all be loaded into the memory.
storage.max_chunks_up 128
[INPUT]
# https://docs.fluentbit.io/manual/pipeline/inputs/windows-event-log
Tag default_pipeline.windows_event_log
Name winlog
Channels System,Application,Security
Interval_Sec 1
DB C:\ProgramData\Google\Cloud Operations\Ops Agent\run/buffers/default_pipeline_windows_event_log
[FILTER]
Name modify
Match default_pipeline.windows_event_log
Add logName windows_event_log
[FILTER]
Name rewrite_tag
Match default_pipeline.windows_event_log
Rule $logName .* $logName false
Emitter_Storage.type filesystem
Emitter_Mem_Buf_Limit 10M
[FILTER]
Name modify
Match windows_event_log
Remove logName
[OUTPUT]
# https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
Name stackdriver
resource gce_instance
Match windows_event_log
# https://docs.fluentbit.io/manual/administration/scheduling-and-retries
# After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
Retry_Limit 3
# https://docs.fluentbit.io/manual/administration/security
# Enable TLS support.
tls On
# Do not force certificate validation.
tls.verify Off
I've tried the following:
I'm still seeing the errors consistently.
Turned out that the issue is consistent on Windows VMs that are optimized for containers. But on a regular Windows 2016 VM, logs from Application
channels arrived successfully at least initially.
I tried to clean the Application
channels on the regular Windows 2016 VM, and started to see this issue. I guess the Windows VMs that are optimized for containers may be constantly doing some file rotation kind of operations which have the same effect as clearing a channel.
@qingling128 I posted PR that addresses your issue to #3028.
Please reply to me if your issue still occurs with that patch applied.
Some Addendum: Actually it was my bug. When I originally implemented the pluign,
I misread the maximum size of buffer allowed to be 0x7fff
, while it's actually 0x7ffff
.
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-readeventlogw
This is why in_winlog
was failing with error 122
(= ERROR_INSUFFICENT_BUFFER
).
It's just my silly mistake and #3028 should amend that slippage.
@jsubirat's original issue is a different one, which occurs when Clear-EventLog
erases
the event log concurrently while in_winlog
is reading. In that case, we need to close the
handle for the Event stream and reopen it.
I'll add a fix for that too next week...
That's awesome, @fujimotos ! In fact, we originally experienced the 122 error, and it was when trying to reproduce it that we found this other issue when concurrently clearing the event logs. Thanks a lot for fixing both! Can we expect another FluentBit version coming out next week containing both fixes? 🤞
@jsubirat FYI I submit #3033 as a bug fix for your issue. I expect that the fix makes into v1.7.x.
I just built a fluent-bit exe off rc5 which includes above fix. It has trouble with eventlogs. I saw below error msgs:
2021/02/10 19:32:10] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2021/02/10 19:32:10] [debug] [out coro] cb_destroy coro_id=16
[2021/02/10 19:32:10] [debug] [task] destroy task=00636338 (task_id=0)
[2021/02/10 19:32:11] [debug] [storage] [cio file] synced at: tail.0/2716-1612985530. 0.flb
[2021/02/10 19:32:11] [debug] [task] created task=00636498 id=0 OK
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 524112 bytes from 'System'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<System record=38504 time=1610326792>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 521072 bytes from 'Application'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<Application record=549337 time=1612418569>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 523852 bytes from 'Security'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<Security record=31008 time=1609798764>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [output:stackdriver:stackdriver.0] [logging.googleapis.com/monitored_resource] not found in the payload
[2021/02/10 19:32:11] [debug] [http_client] not using http_proxy for header
fluent-bit.conf
[SERVICE]
# https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file#config_section
# Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay.
Flush 1
# Waits 120 seconds after receiving a SIGTERM before it shuts down to minimize log loss.
Grace 12
# We use systemd to manage Fluent Bit instead.
Daemon off
# Log_File is set by Fluent Bit systemd unit (e.g. /var/log/google-cloud-ops-agent/subagents/logging-module.log).
Log_Level debug
# https://docs.fluentbit.io/manual/administration/monitoring
# Enable a built-in HTTP server that can be used to query internal information and monitor metrics of each running plugin.
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
# https://docs.fluentbit.io/manual/administration/buffering-and-storage#service-section-configuration
# storage.path is set by Fluent Bit systemd unit (e.g. /var/lib/google-cloud-ops-agent/fluent-bit/buffers).
storage.sync normal
# Enable the data integrity check when writing and reading data from the filesystem.
storage.checksum on
# The maximum amount of data to load into the memory when processing old chunks from the backlog that is from previous Fluent Bit processes (e.g. Fluent Bit may have crashed or restarted).
storage.backlog.mem_limit 50M
# Enable storage metrics in the built-in HTTP server.
storage.metrics on
# This is exclusive to filesystem storage type. It specifies the number of chunks (every chunk is a file) that can be up in memory.
# Every chunk is a file, so having it up in memory means having an open file descriptor. In case there are thousands of chunks,
# we don't want them to all be loaded into the memory.
storage.max_chunks_up 128
[INPUT]
# https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
Name tail
DB /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-fluent-bit
Path /var/log/google-cloud-ops-agent/subagents/logging-module.log
Tag ops-agent-fluent-bit
# Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
Buffer_Chunk_Size 512k
# Set the max size a bit larger to accommodate for long log lines.
Buffer_Max_Size 5M
# When a message is unstructured (no parser applied), append it under a key named "message".
Key message
# Increase this to 30 seconds so log rotations are handled more gracefully.
Rotate_Wait 30
# Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
Skip_Long_Lines On
# https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
# Buffer in disk to improve reliability.
storage.type filesystem
# https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
# This controls how much data the input plugin can hold in memory once the data is ingested into the core.
# This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
# When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
# as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
Mem_Buf_Limit 10M
[INPUT]
# https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
Name tail
DB /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-collectd
Path /var/log/google-cloud-ops-agent/subagents/metrics-module.log
Tag ops-agent-collectd
# Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
Buffer_Chunk_Size 512k
# Set the max size a bit larger to accommodate for long log lines.
Buffer_Max_Size 5M
# When a message is unstructured (no parser applied), append it under a key named "message".
Key message
# Increase this to 30 seconds so log rotations are handled more gracefully.
Rotate_Wait 30
# Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
Skip_Long_Lines On
# https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
# Buffer in disk to improve reliability.
storage.type filesystem
# https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
# This controls how much data the input plugin can hold in memory once the data is ingested into the core.
# This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
# When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
# as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
Mem_Buf_Limit 10M
[INPUT]
# https://docs.fluentbit.io/manual/pipeline/inputs/windows-event-log
Tag default_pipeline.windows_event_log
Name winlog
Channels System,Application,Security
Interval_Sec 1
DB /var/lib/google-cloud-ops-agent/fluent-bit/buffers/default_pipeline_windows_event_log
[FILTER]
Name modify
Match default_pipeline.windows_event_log
Add logName windows_event_log
[FILTER]
Name rewrite_tag
Match default_pipeline.windows_event_log
Rule $logName .* $logName false
Emitter_Storage.type filesystem
Emitter_Mem_Buf_Limit 10M
[FILTER]
Name modify
Match windows_event_log
Remove logName
[OUTPUT]
# https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
Name stackdriver
resource gce_instance
Match ops-agent-fluent-bit
# https://docs.fluentbit.io/manual/administration/scheduling-and-retries
# After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
Retry_Limit 3
# https://docs.fluentbit.io/manual/administration/security
# Enable TLS support.
tls On
# Do not force certificate validation.
tls.verify Off
[OUTPUT]
# https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
Name stackdriver
resource gce_instance
Match ops-agent-collectd
# https://docs.fluentbit.io/manual/administration/scheduling-and-retries
# After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
Retry_Limit 3
# https://docs.fluentbit.io/manual/administration/security
# Enable TLS support.
tls On
# Do not force certificate validation.
tls.verify Off
[OUTPUT]
# https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
Name stackdriver
resource gce_instance
Match windows_event_log
# https://docs.fluentbit.io/manual/administration/scheduling-and-retries
# After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
Retry_Limit 3
# https://docs.fluentbit.io/manual/administration/security
# Enable TLS support.
tls On
# Do not force certificate validation.
tls.verify Off
It seems the issue is with the filter plugins both seen in Linux and Windows, i created a separate github issue here: https://github.com/fluent/fluent-bit/issues/3065
This ticket is fixed on master HEAD. Now let's close it.
Bug Report
Describe the bug Whenever Fluent Bit is reading a winlog event channel (let's say
Application
), if that channel gets cleared then Fluent Bit becomes unable to read from that channel again (failed to read 'Application'
), even if new messages are sent there. The only way to solve it is by restarting it.To Reproduce
Clear the
Application
channel using Event Viewer:Start Fluent Bit with the following configuration:
[OUTPUT] Name stdout Match *
New-EventLog -LogName Application -Source "Josep Test" Write-EventLog -LogName Application -Source "Josep Test" -EntryType Information -EventId 16664 -Message "Hello world, will clear logs soon"