apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4 #14826

Open GBM-tamerm opened 2 years ago

GBM-tamerm commented 2 years ago

Describe the bug We saw strange behavior , Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients as broker stopped accepting connections and clients start

The broker java process itself is up and running but looks freeze, as curl http ports such as curl broker metrics stop return anything.

It is only works when we restart the broker again

So it seems as connection pool issue or leak as it keeps alive specially we can some logs entries as below

[pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x15a65d4f, L:/10.244.63.22:36068 - xxxxx:6651]] Connected to server Killed

And majority of the close_ wait connections between broker and bookies

Also i can see many occurrence of below exception: Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322] Caused by: java.util.NoSuchElementException: No value present

To Reproduce The issue start to happen when K8 clients connecting our cluster (VM) and start producing messages with average volume 50k/s

Expected behavior Broker should not freeze , and close the connections with clients when it received FIN ack

Screenshots image Desktop (please complete the following information):

Additional context No k8 deployment , and client accessing brokers DNS directly without Pulsar proxy.

lhotari commented 2 years ago

Adding a link to the dev mailing list thread: https://lists.apache.org/thread/tsg8q6xc75605jrs66rvj2f3dhndo5k4

lhotari commented 2 years ago

@GBM-tamerm Please provide additional information about the environment:

Setting the open files limits depends on the OS and the way the service is run. Each TCP/IP connection consumes a file handle and that's why it's necessary to tune the open files limit.

Example of setting open files limits for a systemd service: https://github.com/openmessaging/benchmark/blob/89dce6d61c4444fa993ce36098e50ed5e124cb4a/driver-pulsar/deploy/ssd/templates/pulsar.service#L11 Uses LimitNOFILE=300000 in the [Service] section. This is used in all components (broker, bookkeeper, zookeeper).

lhotari commented 2 years ago

After adjusting the open files limits for the services (and making sure that the settings are effective), I'd also recommend adjusting the TCP/IP keepalive settings. By default, a TCP/IP connection might linger in half closed state for up to 2 hours.

Again, modifying the Linux kernel sysctl settings depend on the OS. For Ubuntu / systemd, settings can be placed in /etc/sysctl.d directory. systemd has a service called systemd-sysctl which applies the settings when the service is restarted (systemctl restart systemd-sysctl).

One example of tuning TCP/IP keepalive settings, reducing timeout from 2 hrs to 20 minutes:

# Tune TCP/IP keepalive settings
# http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

Script to configure

echo -e "# Tune TCP/IP keepalive settings\n# http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html\nnet.ipv4.tcp_keepalive_time = 1200\nnet.ipv4.tcp_keepalive_intvl = 60\nnet.ipv4.tcp_keepalive_probes = 20" | sudo tee /etc/sysctl.d/99-keepalive.conf
sudo systemctl restart systemd-sysctl
lhotari commented 2 years ago

You can find out if there's a lot of connection in half closed state with netstat -tapn | grep CLOSE_WAIT or ss|grep CLOSE-WAIT.

lhotari commented 2 years ago

One detail is that Pulsar doesn't use TCP/IP keepalive at the moment. I have created a PR to start a discussion whether it would be useful. That is in #14841 . Pulsar does have an application level keepalive solution in the Pulsar binary protocol.

GBM-tamerm commented 2 years ago

Thanks Lari , I added the additional details you asked ,. We are not setting LimitNOFILE in our systemctl service and leave it to the default. systemctl show -p DefaultLimitNOFILE >> DefaultLimitNOFILE=4096

I will try to run netstat command when the issue happen again. Our PRD cluster running in an old version 2.3.2 , and does not seem to have this issue we see in 2.7 Also some client reports seeing the below error: org.apache.pulsar.client.api.PulsarClientException$LookupException: Reached max number of redirections Is that something we can tune in broker.conf ?

GBM-tamerm commented 2 years ago

Hi Lari ,

I did your suggestions and did not resolve the issue , broker still freeze after a while I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients

As the majority of the close_ wait connections between broker and bookies, it indicate deadlock /race is happening

attached the jstack that shows the deadlock jstack_12.txt

https://jstack.review?https://gist.github.com/GBM-tamerm/a29b793db94702ea58da449927938cad

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.