canonical / cos-proxy-operator

https://charmhub.io/cos-proxy
Apache License 2.0
2 stars 12 forks source link

Vector keeps connections open indefinitely when connections are dropped from an external firewall #112

Closed dnegreira closed 9 months ago

dnegreira commented 10 months ago

Bug Description

Hi,

We have found an issue with a customer where the number of connections/FD open on the vector process was so high that it was not able to allocate any more sockets for any other connections on that server. This was because the connections were being closed due to some external factor - usually a firewall killing long standing connections, or other network issues - and since vector did not receive a FIN/RST or any other answer on that socket, it kept the connection alive indefinitely.

To Reproduce

I have only tested/verified this running cos-proxy on LXD as it was similar to the customer environment.

  1. Do a regular deployment of some applications related to COS/COS-proxy, where COS-Proxy is running on an LXD container.
  2. On the container, verify which other units/applications are connected to vector by using ss -ntpm | grep vector
  3. DROP all the connections towards port 5066 for a couple of minutes: iptables -A INPUT -p tcp --dport 5066 -j DROP
  4. Wait 5 minutes so that the connections are closed from the application side
  5. After you ensure that the connections have been dropped from the connection side, remove the iptables rule and let them connect again: iptables -D INPUT 1
  6. You should see the old connections + new connections with ss -ntpm | grep vector and the old ones are not getting dropped.

I have fixed this by adding the (keepalive)[https://vector.dev/docs/reference/configuration/sources/socket/#keepalive] option under the logstash definition and adding a timeout there.

Short fix incoming.

Environment

I only tested this on a COS-proxy running on an LXD container, the rest of the deployment doesn't really matter, as long as there are some applications related to COS and logstash.

Relevant log output

N/A

Additional context

No response