fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.81k stars 1.58k forks source link

[in_winlog] Reading winlog events fails if logs are cleared using EventViewer #2873

Closed jsubirat closed 3 years ago

jsubirat commented 3 years ago

Bug Report

Describe the bug Whenever Fluent Bit is reading a winlog event channel (let's say Application), if that channel gets cleared then Fluent Bit becomes unable to read from that channel again (failed to read 'Application'), even if new messages are sent there. The only way to solve it is by restarting it.

To Reproduce

[OUTPUT] Name stdout Match *


- Send a message manually using PowerShell:

New-EventLog -LogName Application -Source "Josep Test" Write-EventLog -LogName Application -Source "Josep Test" -EntryType Information -EventId 16664 -Message "Hello world, will clear logs soon"



- Observe how the message is received and processed by Fluent Bit
<img width="965" alt="Screen Shot 2020-12-17 at 5 22 22 PM" src="https://user-images.githubusercontent.com/6243832/102514398-b454fd80-408c-11eb-8ac1-76aa82a0fbd6.png">

- Clear the messages from the `Application` channel again using Event Viewer.

- Observe how Fluent Bit starts reporting the following error repeatedly.
<img width="975" alt="Screen Shot 2020-12-17 at 5 22 36 PM" src="https://user-images.githubusercontent.com/6243832/102514426-c0d95600-408c-11eb-84e8-73b84eec9d93.png">

- Try sending any new message again. It should not be processed by Fluent Bit.

- Restart Fluent Bit. It should be back to normal.

**Expected behavior**
- Fluent Bit should keep reading messages even if a channel is cleared.

**Screenshots**
This is the screenshot when doing all the steps from the "To Reproduce" section:
<img width="968" alt="Screen Shot 2020-12-17 at 5 22 52 PM" src="https://user-images.githubusercontent.com/6243832/102514540-e2d2d880-408c-11eb-8f8d-7c7c9cc52f57.png">

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: 1.6.3
* Configuration: (provided above)
* Environment name and version (e.g. Kubernetes? What version?): -
* Server type and version: MacBook Pro
* Operating System and version: Windows 10 (via VMWare)
* Filters and plugins: `in_winlog`, `stdout`

**Additional context**
It would be really useful if Fluent Bit outputted more descriptive messages on why it failed to read the `Application` channel, or any channel. We are getting this message in other random scenarios, which we're not able to reproduce (yet). Having more information would therefore be really useful for us in diagnosing them.
qingling128 commented 3 years ago

We are seeing a very similar issue. We did not clear Application logs though. We see this with both https://github.com/fluent/fluent-bit/tree/v1.7.0-rc3 and https://github.com/fluent/fluent-bit/tree/v1.7.0-rc4. Logs from other channels arrived just fine, but not with Application channels.

[2021/02/06 00:08:56] [ info] [engine] started (pid=4920)
[2021/02/06 00:08:56] [ info] [storage] version=1.1.0, initializing...
[2021/02/06 00:08:56] [ info] [storage] root path 'C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers'
[2021/02/06 00:08:56] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/02/06 00:08:56] [ info] [storage] backlog input plugin: storage_backlog.3
[2021/02/06 00:08:56] [ info] [input:storage_backlog:storage_backlog.3] queue memory limit: 47.7M
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [output:stackdriver:stackdriver.2] metadata_server set to http://metadata.google.internal
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.2] client_email is not defined, using a default one
[2021/02/06 00:08:56] [ warn] [output:stackdriver:stackdriver.2] private_key is not defined, fetching it from metadata server
[2021/02/06 00:08:56] [ info] [sp] stream processor started
[2021/02/06 00:08:57] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:57] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:08:58] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:58] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:08:59] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:08:59] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:00] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:00] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:01] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:01] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:02] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:02] [error] [input:winlog:winlog.2] failed to read 'Application'
[2021/02/06 00:09:03] [error] [in_winlog] cannot read 'Application' (122)
[2021/02/06 00:09:03] [error] [input:winlog:winlog.2] failed to read 'Application'
qingling128 commented 3 years ago

The related config we use is:

[SERVICE]
    # https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file#config_section
    # Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay.
    Flush      1
    # We use systemd to manage Fluent Bit instead.
    Daemon     off
    # Log_File is set by Fluent Bit systemd unit (e.g. /var/log/google-cloud-ops-agent/subagents/logging-module.log).
    Log_Level  info

    # https://docs.fluentbit.io/manual/administration/monitoring
    # Enable a built-in HTTP server that can be used to query internal information and monitor metrics of each running plugin.
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#service-section-configuration
    # storage.path is set by Fluent Bit systemd unit (e.g. /var/lib/google-cloud-ops-agent/fluent-bit/buffers).
    storage.sync               normal
    # Enable the data integrity check when writing and reading data from the filesystem.
    storage.checksum           on
    # The maximum amount of data to load into the memory when processing old chunks from the backlog that is from previous Fluent Bit processes (e.g. Fluent Bit may have crashed or restarted).
    storage.backlog.mem_limit  50M
    # Enable storage metrics in the built-in HTTP server.
    storage.metrics            on
    # This is exclusive to filesystem storage type. It specifies the number of chunks (every chunk is a file) that can be up in memory.
    # Every chunk is a file, so having it up in memory means having an open file descriptor. In case there are thousands of chunks,
    # we don't want them to all be loaded into the memory.
    storage.max_chunks_up      128

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/windows-event-log
    Tag            default_pipeline.windows_event_log
    Name           winlog
    Channels       System,Application,Security
    Interval_Sec   1
    DB             C:\ProgramData\Google\Cloud Operations\Ops Agent\run/buffers/default_pipeline_windows_event_log

[FILTER]
    Name  modify
    Match default_pipeline.windows_event_log
    Add   logName windows_event_log

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.windows_event_log
    Rule                  $logName .* $logName false
    Emitter_Storage.type  filesystem
    Emitter_Mem_Buf_Limit 10M

[FILTER]
    Name   modify
    Match  windows_event_log
    Remove logName

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    windows_event_log

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off
qingling128 commented 3 years ago

I've tried the following:

I'm still seeing the errors consistently.

qingling128 commented 3 years ago

Turned out that the issue is consistent on Windows VMs that are optimized for containers. But on a regular Windows 2016 VM, logs from Application channels arrived successfully at least initially.

I tried to clean the Application channels on the regular Windows 2016 VM, and started to see this issue. I guess the Windows VMs that are optimized for containers may be constantly doing some file rotation kind of operations which have the same effect as clearing a channel.

fujimotos commented 3 years ago

@qingling128 I posted PR that addresses your issue to #3028.

Please reply to me if your issue still occurs with that patch applied.

fujimotos commented 3 years ago

Some Addendum: Actually it was my bug. When I originally implemented the pluign, I misread the maximum size of buffer allowed to be 0x7fff, while it's actually 0x7ffff.

https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-readeventlogw

This is why in_winlog was failing with error 122 (= ERROR_INSUFFICENT_BUFFER). It's just my silly mistake and #3028 should amend that slippage.

@jsubirat's original issue is a different one, which occurs when Clear-EventLog erases the event log concurrently while in_winlog is reading. In that case, we need to close the handle for the Event stream and reopen it.

I'll add a fix for that too next week...

jsubirat commented 3 years ago

That's awesome, @fujimotos ! In fact, we originally experienced the 122 error, and it was when trying to reproduce it that we found this other issue when concurrently clearing the event logs. Thanks a lot for fixing both! Can we expect another FluentBit version coming out next week containing both fixes? 🤞

fujimotos commented 3 years ago

@jsubirat FYI I submit #3033 as a bug fix for your issue. I expect that the fix makes into v1.7.x.

sophieyfang commented 3 years ago

I just built a fluent-bit exe off rc5 which includes above fix. It has trouble with eventlogs. I saw below error msgs:

2021/02/10 19:32:10] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2021/02/10 19:32:10] [debug] [out coro] cb_destroy coro_id=16
[2021/02/10 19:32:10] [debug] [task] destroy task=00636338 (task_id=0)
[2021/02/10 19:32:11] [debug] [storage] [cio file] synced at: tail.0/2716-1612985530.   0.flb
[2021/02/10 19:32:11] [debug] [task] created task=00636498 id=0 OK
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 524112 bytes from 'System'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<System record=38504 time=1610326792>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 521072 bytes from 'Application'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<Application record=549337 time=1612418569>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] read 523852 bytes from 'Security'
[2021/02/10 19:32:11] [debug] [input:winlog:winlog.2] save channel<Security record=31008 time=1609798764>
[2021/02/10 19:32:11] [error] [input chunk] no available chunk
[2021/02/10 19:32:11] [debug] [output:stackdriver:stackdriver.0] [logging.googleapis.com/monitored_resource] not found in the payload
[2021/02/10 19:32:11] [debug] [http_client] not using http_proxy for header

fluent-bit.conf

[SERVICE]
    # https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file#config_section
    # Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay.
    Flush      1
    # Waits 120 seconds after receiving a SIGTERM before it shuts down to minimize log loss.
    Grace      12
    # We use systemd to manage Fluent Bit instead.
    Daemon     off
    # Log_File is set by Fluent Bit systemd unit (e.g. /var/log/google-cloud-ops-agent/subagents/logging-module.log).
    Log_Level  debug
    # https://docs.fluentbit.io/manual/administration/monitoring
    # Enable a built-in HTTP server that can be used to query internal information and monitor metrics of each running plugin.
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#service-section-configuration
    # storage.path is set by Fluent Bit systemd unit (e.g. /var/lib/google-cloud-ops-agent/fluent-bit/buffers).
    storage.sync               normal
    # Enable the data integrity check when writing and reading data from the filesystem.
    storage.checksum           on
    # The maximum amount of data to load into the memory when processing old chunks from the backlog that is from previous Fluent Bit processes (e.g. Fluent Bit may have crashed or restarted).
    storage.backlog.mem_limit  50M
    # Enable storage metrics in the built-in HTTP server.
    storage.metrics            on
    # This is exclusive to filesystem storage type. It specifies the number of chunks (every chunk is a file) that can be up in memory.
    # Every chunk is a file, so having it up in memory means having an open file descriptor. In case there are thousands of chunks,
    # we don't want them to all be loaded into the memory.
    storage.max_chunks_up      128

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
    Name               tail
    DB                 /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-fluent-bit
    Path               /var/log/google-cloud-ops-agent/subagents/logging-module.log
    Tag                ops-agent-fluent-bit
    # Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
    Buffer_Chunk_Size  512k
    # Set the max size a bit larger to accommodate for long log lines.
    Buffer_Max_Size    5M
    # When a message is unstructured (no parser applied), append it under a key named "message".
    Key                message
    # Increase this to 30 seconds so log rotations are handled more gracefully.
    Rotate_Wait        30
    # Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
    Skip_Long_Lines    On

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
    # Buffer in disk to improve reliability.
    storage.type       filesystem

    # https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
    # This controls how much data the input plugin can hold in memory once the data is ingested into the core.
    # This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
    # When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
    # as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
    Mem_Buf_Limit      10M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
    Name               tail
    DB                 /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-collectd
    Path               /var/log/google-cloud-ops-agent/subagents/metrics-module.log
    Tag                ops-agent-collectd
    # Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
    Buffer_Chunk_Size  512k
    # Set the max size a bit larger to accommodate for long log lines.
    Buffer_Max_Size    5M
    # When a message is unstructured (no parser applied), append it under a key named "message".
    Key                message
    # Increase this to 30 seconds so log rotations are handled more gracefully.
    Rotate_Wait        30
    # Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
    Skip_Long_Lines    On

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
    # Buffer in disk to improve reliability.
    storage.type       filesystem

    # https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
    # This controls how much data the input plugin can hold in memory once the data is ingested into the core.
    # This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
    # When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
    # as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
    Mem_Buf_Limit      10M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/windows-event-log
    Tag            default_pipeline.windows_event_log
    Name           winlog
    Channels       System,Application,Security
    Interval_Sec   1
    DB             /var/lib/google-cloud-ops-agent/fluent-bit/buffers/default_pipeline_windows_event_log

[FILTER]
    Name  modify
    Match default_pipeline.windows_event_log
    Add   logName windows_event_log

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.windows_event_log
    Rule                  $logName .* $logName false
    Emitter_Storage.type  filesystem
    Emitter_Mem_Buf_Limit 10M

[FILTER]
    Name   modify
    Match  windows_event_log
    Remove logName

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    ops-agent-fluent-bit

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    ops-agent-collectd

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    windows_event_log

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off
sophieyfang commented 3 years ago

It seems the issue is with the filter plugins both seen in Linux and Windows, i created a separate github issue here: https://github.com/fluent/fluent-bit/issues/3065

fujimotos commented 3 years ago

This ticket is fixed on master HEAD. Now let's close it.