fluent / fluent-plugin-windows-eventlog

Fluentd plugin to collect windows event logs
Apache License 2.0
33 stars 19 forks source link

Fluentd workers SIGSEGV for specific Windows events channels (security and powershell events) #106

Open iivvss opened 2 months ago

iivvss commented 2 months ago

Hi team,

I am experiencing regular crashes from some ruby workers collecting windows events using windows event forwarding and custom event channels.

I am seeing worker 2 (WEC_Security-OS-Security-Low-3 - Specific security event IDs) and 4 (WEC_AppAndServices-OS-PowerShell - Powershell events) and worker crashing in a loop:

2024-06-05 15:09:47 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:09:56 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:18 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:20 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:28 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:30 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:40 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:43 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:10:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV 2024-06-05 15:11:01 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV

I suspect the windows_eventlog2 plugin to reach a limit somewhere but it's not correlated to the number of events, as neither Security low 3 nor Powershell have the biggest number of events accross all other channels.

Here is the fluentd configuration I am working with:


<system>
    log_level info
    workers 7
    # restart_worker_interval 1s
    <log>
        rotate_size 1048576
    </log>
</system>

<worker 0>
    <source>
        # @log_level trace
        @type windows_eventlog2
        @id windows_eventlog2 #id must be unique
        channels WEC_AppAndServices-OS-AppLocker,WEC_AppAndServices-OS-DNSServer,WEC_AppAndServices-OS-RemoteAccess,WEC_AppAndServices-OS-ADFS,WEC_AppAndServices-OS-Application,WEC_Security-OS-Security-Medium
        read_existing_events false
        tag windows_event_logs
        rate_limit 1000
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false #in memory storage
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name # Set  Name
            host x.x.x.x  of the  provisioned for the client
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

<worker 1>
    <source>
        # @log_level trace
        @type windows_eventlog2
        @id windows_eventlog21 #id must be unique
        channels WEC_Security-OS-Security-Low-2
        read_existing_events false
        tag windows_event_logs
        rate_limit 100
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name # Set  Name
            host x.x.x.x  of the  provisioned for the client
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

<worker 2>
    <source>
        # @log_level trace
        @type windows_eventlog2
        @id windows_eventlog22 #id must be unique
        channels WEC_Security-OS-Security-Low-3
        read_existing_events false
        tag windows_event_logs
        rate_limit 50
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name # Set  Name
            host x.x.x.x  of the  provisioned for the client
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

<worker 3>
    <source>
        @type windows_eventlog2
        @id windows_eventlog23 #id must be unique
        channels WEC_Security-OS-Security-High,WEC_System-OS-System,WEC_AppAndServices-OS-MicrosoftDefender
        read_existing_events false
        tag windows_event_logs
        rate_limit 1000
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name # Set  Name
            host x.x.x.x  of the  provisioned for the client
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

<worker 4>
    <source>
        @type windows_eventlog2
        @id windows_eventlog24 #id must be unique
        channels WEC_AppAndServices-OS-PowerShell
        read_existing_events false
        tag windows_event_logs
        rate_limit 50
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name # Set  Name
            host x.x.x.x  of the  provisioned for the client
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

<worker 5>
    <source>
        @type windows_eventlog2
        @id windows_eventlog25 #id must be unique
        channels WEC_Security-OS-Security-Low-1
        read_existing_events false
        tag windows_event_logs
        rate_limit 50
        read_interval 1s
        render_as_xml true
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    <filter windows_event_logs>
        @type record_transformer
        enable_ruby
        <record>
            message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        </record>
    </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name device_name
            host x.x.x.x 
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

Could you help ?

Cheers,

Ivs

daipom commented 2 months ago

Hmm, I wonder what could be the cause... I know several cases that work without these problems, so maybe there is some unique cause. Could you please tell me the following info?

iivvss commented 2 months ago

Hi daipom,

Thanks for the quick answer. Sure, here is the information:

daipom commented 2 months ago

Thanks! I don't see anything wrong with the environment...

I am experiencing regular crashes

So, there's some reproducibility. We need to narrow down the cause.

Is it caused by a specific event log? (Like, it contains characters that are not supposed...)

Is there a particular setting we can change to avoid this issue? (Such as render_as_xml)

iivvss commented 2 months ago

So I have set therender_as_xml to false for workers 2 and 4 but I see no improvements (I had to remove the record transformer section also) :

<worker 2>
    <source>
        # @log_level trace
        @type windows_eventlog2
        @id windows_eventlog22 #id must be unique
        channels WEC_Security-OS-Security-Low-3
        read_existing_events false
        tag windows_event_logs
        rate_limit 50
        read_interval 1s
        render_as_xml false
        <storage>
            persistent false
        </storage>
        <parse>
            @type none
        </parse>
    </source>

    # <filter windows_event_logs>
        # @type record_transformer
        # enable_ruby
        # <record>
            # message ${record["message"].gsub("\r\n", "\\r\\n").gsub("\t", "\\t")}
        # </record>
    # </filter>

    <match windows_event_logs>
        @type forward
        send_timeout 60s
        recover_wait 10s
        hard_timeout 60s
        keepalive true
        keepalive_timeout 180
        #compress gzip
        <server>
            name <hostname># Set FPE Name
            host x.x.x.x 
            port 10000
            weight 60
        </server>
        <buffer>
            @type memory
            flush_mode immediate #no flush_interval if mode is immediate
            chunk_limit_size 512m
            flush_thread_count 16
            total_limit_size 4GB #max total RAM allocation
        </buffer>
    </match>
</worker>

I still see the workers crashing:

2024-06-10 14:23:36 +0000 [info]: spawn command to main:  cmdline=["E:/opt/fluent/bin/ruby.exe", "-Eascii-8bit:ascii-8bit", "E:/opt/fluent/bin/fluentd", "-c", "E:\\opt\\fluent\\etc\\fluent\\fluentd.conf", "-o", "E:\\opt\\fluent\\fluentd.log", "-x", "fluentdwinsvc", "--under-supervisor"]
2024-06-10 14:24:03 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:03 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:15 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:24 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:49 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:24:51 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:02 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:13 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:22 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:23 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:32 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:44 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:25:57 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:07 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:09 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:21 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:35 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:48 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:26:58 +0000 [error]: Worker 4 exited unexpectedly with signal SIGSEGV
2024-06-10 14:27:00 +0000 [error]: Worker 2 exited unexpectedly with signal SIGSEGV

Is it caused by a specific event log?

I couldn't pinpoint any specific windows event. Worker 2 and 4 are configured to read .evtx files containing specific subset of security and powershell events.

The WEC_Security-OS-Security-Low-3 evtx has a majority of 4627 "Group Membership" events. The WEC_AppAndServices-OS-PowerShell evtx file has majority of event id 400 and 600 events.

I see no particular event linked to the frequency of the worker crashes. I can try suppressing events from the subscriptions one by one but perhaps you have other ideas before I go this way :) (this will be tedious).