fluent-plugins-nursery / fluent-plugin-remote_syslog

Fluentd plugin for output to remote syslog serivce (e.g. Papertrail)
https://github.com/dlackty/fluent-plugin-remote_syslog
MIT License
68 stars 53 forks source link

Timeout issue #58

Open mareban opened 1 year ago

mareban commented 1 year ago

Hello,

We are using fluent-plugin-remote_syslog to forward session data events of a remote access tool based on an IP field to forward events to a dedicated syslog servers depending on the ip (IP plan) ! In other words, we have several sites, we can connect to these sites using a remote access solution, and want to forward sessions details to the syslog server of the site accessed !

If one of the sites is down, fluentd seems to be blocked and try to connect indefinitively to the site , and nothing else are forwarded, despite a timeout parameter is set !?

Is it a bug :-(, and if it is, do you plan to fix it please ?

If a timeout occured for a site, and if the timeout works, what happen to the events of the site unreachable, are they lost or still buffered and resend when the syslog server of the site is up again ?

Thanks for your help.

daipom commented 1 year ago

Hi, thanks for your report.

We are using fluent-plugin-remote_syslog to forward session data events of a remote access tool based on an IP field to forward events to a dedicated syslog servers depending on the ip (IP plan) ! In other words, we have several sites, we can connect to these sites using a remote access solution, and want to forward sessions details to the syslog server of the site accessed !

I want to know the setting of the plugin. Do you use a placeholder feature for host to send to different servers depending on log contents?

If one of the sites is down, fluentd seems to be blocked and try to connect indefinitively to the site , and nothing else are forwarded, despite a timeout parameter is set !?

Do you mean a timeout parameter doesn't work as expected? What parameter do you use?

Is it a bug :-(, and if it is, do you plan to fix it please ?

If it becomes clear that it is a bug in this plugin, I want to fix it. On the other hand, it is possible that the problem is not a bug in this plugin, but a problem with TCP or other specifications. We need to clarify where the problem lies.

To fix the problems, I want to simplify each problem so that it can be reproduced in general.

If a timeout occured for a site, and if the timeout works, what happen to the events of the site unreachable, are they lost or still buffered and resend when the syslog server of the site is up again ?

This is a difficult problem. We need to consider this in terms of both TCP (Do you use TCP?) and the plugin's specifications.

In terms of the plugin specification, if a send fails, the plugin will try to resend according to the buffer retry settings.

However, in terms of TCP, there are some known problems. In TCP, it is necessary to send a FIN to each other before stopping, but often the server side stops one-sidedly before the client sends a FIN. (The client side should also close the socket and send back the FIN as soon as it receives the FIN, but I don't think it is often implemented (not even in this plugin).)

In such a situation, it is possible that the program successfully sent the data (https://github.com/fluent-plugins-nursery/fluent-plugin-remote_syslog/blob/15470e27700938fa4d30d5fd86a3c839356ad5d1/lib/fluent/plugin/out_remote_syslog.rb#L107), but in fact, the data was not sent.

A similar problem is reported in https://stackoverflow.com/questions/11436013/writing-to-a-closed-local-tcp-socket-not-failing. This problem seems to be not limited to a specific programming language.

It is also talked about in #56.

mareban commented 1 year ago

Hi,

Thx for your reply :-) !

Yes we are using a placeholder feature for host and redirect the last 24 hours events to a specific site based on an IP plan !

We are using the timeout parameter , do we need to use others like tcp_keep alive as we are using TCP protocol because message length can be greater than 1024 bytes ?

Sometimes a syslog server can be down, sometimes the server on the sit can be decommisionned, sometimes it can be a filtering issue from the firewall, or a connection lost for whaever reason !

If there is a communication issue, are the events still kept in the buffer, or is it removed (timeout and retries) and go to the next event for the unreachable/down site, or maybe to the another site if the next event is an event to forward to this other site ?

Thanks for your help.