No failure and buffering of data happening when syslog server (deployed through kubernetes) becomes unreachable

Version details:

td-agent-4.3.1
Syslog plugin gems: fluent-plugin-remote_syslog (1.0.0) remote_syslog_sender (1.2.2) syslog_protocol (0.9.2)

Background:

Fluentd and rsyslog are deployed through helm and are running as pods in kubernetes.
There is a k8s service 'rsyslog' created for the rsyslog pod.

Fluentd is configured to read a file and send to the rsyslog server. File buffer is used in the match section for on-disk buffering. Config looks like this

<source>
@type tail
path "/tmp/test.log"
pos_file /var/log/shivtest-retry-fluentd-container-log.pos
tag "test"
<parse>
@type json
</parse>
</source>
<match test>
@type remote_syslog
@log_level trace
host rsyslog.shiv-syslog.svc.cluster.local  
port 514
protocol tcp
timeout 2
<buffer>
 @type file
 path /tmp/buffer
 flush_interval 10s
 flush_mode interval
 flush_thread_count 1
 retry_forever true
 chunk_limit_size 2MB
 total_limit_size 100m
</buffer>
</match>

Issue: Fluentd successfully connects to the configured syslog endpoint and keeps pushing the records as per the flush interval. But when the k8s service of syslog server goes down (i.e. when the syslog pod gets deleted or goes down to 0/1), fluentd does not detect any failures in connection to syslog. It also keeps flushing all the chunks from the file buffer and does not retain anything in buffer - inspite of the destination being unreachable. Pls see that the syslog service is unreachable (as seen by other clients like logger, curl)

  $ logger "Hello world" -n rsyslog.shiv-syslog.svc.cluster.local -P 514  -T
    logger: failed to connect rsyslog.shiv-syslog.svc.cluster.local port 514
  $ curl telnet://rsyslog.shiv-syslog.svc.cluster.local:514
    curl: (7) Failed connect to rsyslog.shiv-syslog.svc.cluster.local:514; Connection refused

Fluentd logs: there are no errors in fluentd. New chunks keep getting created and keep getting cleared from the file buffer location. Trace level logging is enabled.

<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil, seq=0>
2023-03-21 13:25:45 +0000 [trace]: #0 done to commit a chunk chunk="5f768fa89b48e9499eecee4380ecf53f"
2023-03-21 13:25:45 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:45 +0000 [debug]: #0 Created new chunk chunk_id="5f768fb22636dafcbf2df4c4593736a5" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil, seq=0>
2023-03-21 13:25:45 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 22
2023-03-21 13:25:46 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:47 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:47 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:47 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 44
2023-03-21 13:25:47 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:48 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:49 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:49 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 66

So why are the chunks getting flushed from the buffer when the destination is unreachable?

Observation: - 1) An interesting observation is - When syslog server is not running as a k8s pod, and is running as a standalone service on linux (i.e. managed through systemctl), when the service is stopped (using systemctl stop rsyslog), we immediately see the following error in fluentd logs when it tries to flush the next chunk from its buffer to the syslog endpoint. As the flush fails due to connectivity, it retains the chunk in the file buffer and keeps retrying flush (as per the configuration).

2023-03-21 10:03:38 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2023-03-21 10:03:39 +0000 chunk="5f766277ef08d8f6257ad093f0a07328" error_class=Errno::ECONNREFUSED error="Connection refused - connect(2) for \"rsyslog.shiv-syslog.svc.cluster.local\" port 514"
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `initialize'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `new'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `block in connect'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:52:in `synchronize'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:52:in `connect'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:129:in `rescue in send_msg'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:108:in `send_msg'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:49:in `block in transmit'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:37:in `each'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:37:in `transmit'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:105:in `block (2 levels) in write'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:104:in `each_line'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:104:in `block in write'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/buffer/file_chunk.rb:171:in `open'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:103:in `write'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
  2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2023-03-21 10:03:38 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1

Later when the rsyslog service is restarted and up again, it connects and successfully pushes all buffered chunks to the destination without any loss of data. For ex. 2023-03-21 10:23:24 +0000 [warn]: #0 retry succeeded. chunk_id="5f7664fef1ec6c112220332f2732de46" 2) For syslog running over kubernetes, we see that fluentd detects if there is a connectivity issue with syslog initially when fluentd starts running and there retry/buffering works fine as expected. But once fluentd successfully establishes a connection with syslog, and then if syslog destination goes unreachable, fluentd fails to detect connection errors.

Concern: Why is the loss of connectivity not identified in case of rsyslog server running as a kubernetes service? Pls let us know if there're any additional configs in syslog plugin that could help us achieve retry/buffering properly in this case too.

Thanks for reporting! I will check it.

Pls let us know if there're any additional configs in syslog plugin that could help us achieve retry/buffering properly in this case too.

I don't know how effective it is, but what about keep_alive, keep_alive_idle, keep_alive_cnt, keep_alive_intvl?

https://github.com/fluent-plugins-nursery/fluent-plugin-remote_syslog#configuration

Hi @daipom Yes, I have already tried adding the following parameters, but that didn't help.

  timeout 10
  timeout_exception true
  keep_alive true
  keep_alive_cnt 1
  keep_alive_intvl 5
  keep_alive_idle 5

Hi @daipom Latest update: Observation - On further check, we found that after around ~15minutes (of syslog being down), the connectivity error is detected in fluentd logs and it then tries to re-create the connection with syslog endpoint. It then starts retaining the chunks in buffer as the destination is not reachable. Once connection is successful, it retries sending the data present in its buffer at that time. But yes, with this, the concern is the data generated during this 15min window is lost as it was not buffered by the plugin.

Analysis - We believe this 15min timeout is related to tcp communication issue. It could be a result of the default value of tcp_retries2 (on linux) which is 15. _TCP retransmits an unacknowledged packet up to tcpretries2, after that, it gives up and notifies the application that the connection is lost. For ref - https://tech.instacart.com/the-vanishing-thread-and-postgresql-tcp-connection-parameters-93afc0e1208c

To solve this, we tried setting the TCP socket option TCP_USER_TIMEOUT at the time of socket creation with syslog so that the timeouts happen sooner. The change is made (to set the socker option) in the dependent ruby gem i.e. in tcp_sender.rb file in the remote_syslog_sender gem after the socket creation. https://github.com/reproio/remote_syslog_sender/blob/master/lib/remote_syslog_sender/tcp_sender.rb#L57

@tcp_socket.setsockopt(Socket::IPPROTO_TCP , Socket::TCP_USER_TIMEOUT, 6000);

With this change, we observed that in case of syslog going down, after 6 seconds, the connection loss was detected by fluentd and buffering/retry worked as expected.

Does this look like the right approach to solve this issue? Can we expect a fix in the plugin on these lines?

Thanks for finding it out!

I also find an interesting behavior of TCPSocket.

After stopping a TCP server, a client can send the data once without error. Of course, this data does NOT reach the server. (I can reproduce this when I launch a remote TCP server.)

Server: I used Fluentd as a TCP server with the following config

<source>
  @type tcp
  tag test
  <parse>
    @type none
  </parse>
  <transport tcp>
    linger_timeout 5
  </transport>
</source>

<match test.**>
  @type stdout
</match>

Client: irb

require "socket"
socket = TCPSocket.new("IP", 5170)
# The server is alive.
socket.write("fuga\n")
=> 5
# After stopping the server.
socket.write("fuga\n")
=> 5 # Surprisingly, the first one succeeds!!
socket.write("fuga\n")
in `write': Broken pipe (Errno::EPIPE)
    from (irb):43:in `<main>'
    from /home/daipom/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
    from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `load'
    from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `<main>'

If I remove the setting linger_timeout as follows, an error occurs in the first sending after stopping the server.

<source>
  @type tcp
  tag test
  <parse>
    @type none
  </parse>
</source>

require "socket"
socket = TCPSocket.new("IP", 5170)
# The server is alive.
socket.write("fuga\n")
=> 5
# After stopping the server.
socket.write("fuga\n")
in `write': Connection reset by peer (Errno::ECONNRESET)
        from (irb):31:in `<main>'           
        from /home/daipom/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
        from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `load'
        from /home/daipom/.rbenv/versions/3.2.0/bin/irb:25:in `<main>'

This difference is related to SO_LINGER option. If this value is larger than zero, FIN is sent on closing. If this value is zero, RST is sent on closing. (However, Windows always send FIN regardless of this option) If FIN is sent, the connection remains for a while, and the other side can still send the data.

I'm thinking this issue may have something to do with this specification of TCP.

The problem that sending data to a closed TCP socket doesn't fail seems to be a fairly well-known problem:

https://stackoverflow.com/questions/11436013/writing-to-a-closed-local-tcp-socket-not-failing

This is what I found so far.

Thanks for the information about TCP_USER_TIMEOUT!

To solve this, we tried setting the TCP socket option TCP_USER_TIMEOUT at the time of socket creation with syslog so that the timeouts happen sooner. With this change, we observed that in case of syslog going down, after 6 seconds, the connection loss was detected by fluentd and buffering/retry worked as expected. Does this look like the right approach to solve this issue? Can we expect a fix in the plugin on these lines?

Certainly, this could be a valid solution. But I think the ideal solution would be that writing to a closed socket should be a firm error. Please let me look into this a little more.

I now understand the mechanics of this problem.

There are 2 points we need to consider.

When a TCP server sends FIN on closing, we should consider the possibility of the server stopping before the client sends back FIN.
- This is the problem I was talking about in this comment: https://github.com/fluent-plugins-nursery/fluent-plugin-remote_syslog/issues/56#issuecomment-1481480691
- In this case, the client does not know that the server is down until after it has sent the data once, and that data will be lost.
- To prevent this, when the client receives FIN from the server, the client should close the socket and send back FIN to the server without sending any more data.
When a TCP server goes down without sending FIN, We need to make sure the client detects it and tries a new connection.
- If your k8s service of syslog server doesn't send FIN to the clients when it goes down, we need this.
- This is almost resolved with TCP_USER_TIMEOUT as @aggarwalShivani says. The client can detect the server down by it not returning ACK for a certain period of time i.e. TCP_USER_TIMEOUT.
- But this method is not perfect. The sending that could not confirm ACK doesn't become an error, so at least one data will be lost.

@aggarwalShivani Could you please confirm the following two points?

When your k8s service of syslog server goes down, does the server send FIN?
- Maybe the following filters on Wireshark will be helpful in finding out: tcp.srcport == 514 || tcp.dstport == 514.
Even if you use TCP_USER_TIMEOUT, is at least one record lost?

Hi @daipom Apologies for the delayed reply

Response when syslog server goes down a) 0 replicas - When all syslog pods are deleted i.e. there are no pods(endpoints) associated with the k8s service of syslog server, and if TCP communication is attempted to the syslog service, I see the following response in tcpdump -> The request gets routed to kube-dns service to resolve the dns of syslog svc. Later, it responds with port unreachable error. It does not respond with FIN.

08:35:53.461686 IP k8s-4-control-worker-edge-1 > k8s-4-control-worker-edge-1: ICMP rsyslog.shiv.svc.cluster.local tcp port facsys-ntp unreachable, length 60

b) syslog process is not running - However, when the syslog pod is running, but inside the pod, syslog process is killed, it responds with connection reset RST when there's an attempt to contact the syslog service.

19:42:43.993059 IP k8s-4-control-worker-edge-1.9136 > 192-168-165-1.rsyslog.shiv.svc.cluster.local.shell: Flags [S], seq 2061445071, win 26880, options [mss 8960,nop,nop,sackOK,nop,wscale 7], length 0
19:42:43.993076 IP 192-168-165-1.rsyslog.shiv.svc.cluster.local.shell > k8s-4-control-worker-edge-1.9136: Flags [R.], seq 0, ack 2061445072, win 0, length 0

So perhaps that is general k8s behaviour, if there is no endpoint for a service, kube-dns will be unable to resolve it and it would respond with port unreachable.

TCP_USER_TIMEOUT Yes, in some cases I observed that records from the first flush made to syslog server during its downtime, were sometimes lost. I do not have the setup currently, I can recheck. But I believe, if the flush_interval in fluentd buffer was lesser than TCP_USER_TIMEOUT, then the first set of chunks (records) were lost. But the flush_interval is longer and the flush happens only after TCP_USER_TIMEOUT time, then all those chunks get retained in the buffer and there was no loss.

Sorry for my delay, and thanks so much for your reply. I have not been able to take the time for this. I would like to examine this issue some more.

Hi @daipom,

Have you able to conclude on the alternative approach to fix this issue or is it okay to go ahead with the approach of "TCP_USER_TIMEOUT" ?

fluent-plugins-nursery / fluent-plugin-remote_syslog

No failure and buffering of data happening when syslog server (deployed through kubernetes) becomes unreachable #56