dmachard / go-dnscollector

Ingesting, pipelining, and enhancing your DNS logs with usage indicators, security analysis, and additional metadata.
MIT License
205 stars 45 forks source link

Endless wait loop between threads resulting in go-dnscollector (0.45+) to hang #729

Closed romainw closed 4 months ago

romainw commented 4 months ago

When upgrading from 0.41 to 0.45/0.46b, go-dnscollector is now hanging when processing files. The flow works well for tiny fstrm files, but as soon as the file reaches a certain size apparently, this creates a condition between the different threads causing an endless wait loop.

If we take the first 50 000 lines of the file below (dnstap-2024-06-03-19:56:54.fstrm), the file is successfully processed and then deleted. If we take the first 100 000 lines of the very same file (for a total of 30MB), go-dnscollector will hang whilst processing the file, and never complete this work or subsequently delete it:

/usr/bin/go-dnscollector -config /etc/dnscollector/config.yml  
INFO: 2024/06/04 16:39:40.307068 worker - [dnstap-server1] fileingestor - file ready to process /var/pdnssoc_input/server1/dnstap-2024-06-03-19:56:54.fstrm
INFO: 2024/06/04 16:39:40.307282 worker - [dnstap-server1] fileingestor - processing dnstap file [dnstap-2024-06-03-19:56:54.fstrm]

Running strace at this point will show this loop (note the "Connection timed out"):

strace: Process 59329 attached with 9 threads
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59336] futex(0xc000306148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59335] futex(0x2686ed8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59334] futex(0x26870e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59330] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 59332] futex(0xc000073148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59329] epoll_pwait(4, 

 <unfinished ...>
[pid 59330] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59329] <... epoll_pwait resumed>[], 128, 5768, NULL, 0) = 0
[pid 59329] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59329] <... futex resumed>)        = 1
[pid 59337] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] <... futex resumed>)        = 1
[pid 59333] <... futex resumed>)        = 0
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] epoll_pwait(4,  <unfinished ...>
[pid 59329] epoll_pwait(4,  <unfinished ...>
[pid 59337] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] futex(0x20fd420, FUTEX_WAIT_PRIVATE, 0, {tv_sec=9, tv_nsec=994486541}) = -1 ETIMEDOUT (Connection timed out)
[pid 59330] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] <... futex resumed>)        = 0
[pid 59330] <... futex resumed>)        = 1
[pid 59333] sched_yield( <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59333] <... sched_yield resumed>)  = 0
[pid 59333] futex(0x20fd338, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 59333] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59333] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 1
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] nanosleep({tv_sec=0, tv_nsec=3000},  <unfinished ...>
[pid 59337] <... futex resumed>)        = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] <... nanosleep resumed>NULL) = 0
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59329] <... epoll_pwait resumed>[], 128, 9999, NULL, 0) = 0
[pid 59329] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59329] epoll_pwait(4,  <unfinished ...>
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824991665720
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59337] rt_sigreturn({mask=[]})     = 825108996752
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 825078279520
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 825046974064
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824867546816
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG) = 0
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] rt_sigreturn({mask=[]})     = 824757653032
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59337, SIGURG <unfinished ...>
[pid 59337] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59337] rt_sigreturn({mask=[]} <unfinished ...>
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59337] <... rt_sigreturn resumed>) = 824645210520
[pid 59337] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59333] <... futex resumed>)        = 0
[pid 59333] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59337] futex(0x20fd438, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=100000} <unfinished ...>
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 59337] getpid( <unfinished ...>
[pid 59333] futex(0x20fd438, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... getpid resumed>)       = 59329
[pid 59333] <... futex resumed>)        = 0
[pid 59333] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59337] tgkill(59329, 59333, SIGURG) = 0
[pid 59333] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59333] rt_sigreturn({mask=[]})     = 824815153384
[pid 59330] <... nanosleep resumed>NULL) = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid 59330] getpid()                    = 59329
[pid 59330] tgkill(59329, 59333, SIGURG <unfinished ...>
[pid 59333] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=59329, si_uid=0} ---
[pid 59330] <... tgkill resumed>)       = 0
[pid 59330] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 59333] rt_sigreturn({mask=[]})     = 824908148104
[pid 59333] futex(0xc000099948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 1
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] <... futex resumed>)        = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 59337] futex(0xc000072948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 59331] <... futex resumed>)        = 0
[pid 59331] sched_yield( <unfinished ...>
[pid 59337] futex(0x20fd438, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=100000} <unfinished ...>
[pid 59331] <... sched_yield resumed>)  = 0
[pid 59331] futex(0x20fd438, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] <... futex resumed>)        = 0
[pid 59331] <... futex resumed>)        = 1
[pid 59337] sched_yield()               = 0
[pid 59331] futex(0xc000098148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59337] futex(0x20fd338, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 59331] <... futex resumed>)        = 1
[pid 59337] <... futex resumed>)        = 0
[pid 59333] <... futex resumed>)        = 0
[pid 59331] close(9 <unfinished ...>
[pid 59333] futex(0xc000098148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59331] <... close resumed>)        = 0
[pid 59331] futex(0xc000072948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 59337] futex(0xc000099948, FUTEX_WAIT_PRI
<snip>

Trying to exit the go-dnscollector at this point with CTRL+C is also hanging, requiring a full kill -9 of the process:

^CWARNING: 2024/06/04 16:50:38.928759 main - exiting...
INFO: 2024/06/04 16:50:38.928798 worker - [dnstap-server1] fileingestor - stopping monitor...
INFO: 2024/06/04 16:50:38.928811 worker - [dnstap-server1] fileingestor - monitor terminated
INFO: 2024/06/04 16:50:38.928822 worker - [dnstap-server1] fileingestor - stopping collect...
INFO: 2024/06/04 16:50:38.928831 worker - [dnstap-server1] fileingestor - stop to listen...
INFO: 2024/06/04 16:50:38.928970 worker - [dnstap-server1] dns processor - stopping monitor...
INFO: 2024/06/04 16:50:38.928986 worker - [dnstap-server1] dns processor - monitor terminated
INFO: 2024/06/04 16:50:38.928995 worker - [dnstap-server1] dns processor - stopping collect...
INFO: 2024/06/04 16:50:38.929004 worker - [dnstap-server1] dns processor - collection terminated
INFO: 2024/06/04 16:50:38.929017 worker - [dnstap-server1] (conn #0) dnstap processor - stopping monitor...
INFO: 2024/06/04 16:50:38.929025 worker - [dnstap-server1] (conn #0) dnstap processor - monitor terminated
INFO: 2024/06/04 16:50:38.929038 worker - [dnstap-server1] (conn #0) dnstap processor - stopping collect...

Just replacing the go-dnscollector 0.45/0.46b binary with version 0.41 "solves" this issue and the fstrm file is processed instantly and then successfully deleted.

dmachard commented 4 months ago

Thank you for reporting this issue.

Could you please provide more details, such as:

romainw commented 4 months ago

Many thanks for the quick reply.

################################################
# Collector configuration
################################################
global:
  #pid-file: "/var/dnscollector/collector.pid"
  trace:
    verbose: true
  server-identity: "pDNSSOC_C"
  text-format: "timestamp-rfc3339ns qr identity operation rcode queryip queryport protocol qname qtype name"
  # default text field delimiter
  text-format-delimiter: " "
  # default text field boundary
  text-format-boundary: "\""

pipelines:
 - name: dnstap-server1
   file-ingestor:
        watch-dir: /var/pdnssoc_input/server1
        watch-mode: dnstap
        delete-after: true
   transforms:
     filtering:
        log-queries: true
        log-replies: true
        drop-queryip-file: /root/dns_servers.txt
   routing-policy:
          forward: [ filelogdomains, filelogips, fileall ]      
 - name: filelogdomains
   logfile:
    file-path: /var/dnscollector/matches/matches_domains.json
    mode: json
   transforms:
     filtering:
      keep-fqdn-file: '/var/dnscollector/misp_domains.txt'

 - name: filelogips
   logfile:
        file-path: /var/dnscollector/matches/matches_ips.json
        mode: json
   transforms:
     filtering:
      keep-rdata-file: '/var/dnscollector/misp_ips.txt'

 - name: fileall
   logfile:
        file-path: /var/dnscollector/queries/queries.json
        mode: json
        flush-interval: 1
        max-size: 100
        max-files: 10
        chan-buffer-size: 65535
          #postrotate-command: "/var/dnscollector/postrotate_query.sh"
        postrotate-delete-success: true

(this is the configuration file upgraded for 0.46b)

The issue appeared with 0.45.0. In other words, 0.44.0 is the last working version for us.

Thanks again!

dmachard commented 4 months ago

Thank for sharing config, A refactoring of the transformers was done in v0.45.0, I suspect something wrong on it. Could you test without the filtering transform ?

Can you share your frstm file ? it will be more easy to reproduce in my side.

dmachard commented 4 months ago

I found the regression, the dnstap packet process is never started... with the ingestor. I will add a test for this use case to avoid this regression

romainw commented 4 months ago

wow, that was quick! Good work @dmachard!

I cannot easily share the dnstap file, which is coming from our production systems. However, I can/will try any test or beta release to confirm the issue is fixed.

Thanks again!

dmachard commented 4 months ago

fix pushed in master branch and release v0.46.0-beta2

romainw commented 4 months ago

This is just to confirm that the issue has been fixed successfully starting 0.46.0-beta2. Thank you!