apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.15k stars 3.57k forks source link

Topic lookup timeout and can't recover after broker crash #14183

Open zackliu opened 2 years ago

zackliu commented 2 years ago

Describe the bug I tried using Pulsar in Kubernetes and it's deployed using helm. After one broker pod crashed and came back, I found the whole Pulsar didn't work. I used pulsar-perf to publish message and the log showed as below. I can confirm 10.0.129.19 is the pulsar proxy IP exposed by kubernetes service. It's reachable from client. And the broker's log indicate there's a client connection connected.

2022-02-07T09:40:03,110 +0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [chenyltopic2] Failed to get partitioned topic metadata: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'} 2022-02-07T09:40:03,111+0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0xdb0f2346, L:/10.240.2.68:52414 - R:10.0.129.19/10.0.129.19:6650] Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'} 2022-02-07T09:40:03,111+0000 [pulsar-perf-producer-exec-1-1] ERROR org.apache.pulsar.testclient.PerformanceProducer - Got error java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'} at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_312] at org.apache.pulsar.testclient.PerformanceProducer.runProducer(PerformanceProducer.java:595) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1] at org.apache.pulsar.testclient.PerformanceProducer.lambda$main$1(PerformanceProducer.java:425) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312] Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'} at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1204) ~[org.apache.pulsar-pulsar-client-original-2.9.1.jar:2.9.1] at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[io.netty-netty-transport-classes-epoll-4.1.72.Final.jar:4.1.72.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]

To Reproduce The issue happened many times.

  1. Use pulsar-perf to publish some messages at first and it worked well.
  2. And then one broker OOM crashed as I'm sending message too quickly.
  3. I saw many errors and I terminated the perf process and start another one a few seconds later. Then I the issue happened and I can't publish any messages as topic lookup timeout. I tried using different topic name (name that never used before) but doesn't help.

Expected behavior Everything can recover after broker crash.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

lhotari commented 2 years ago

And then one broker OOM crashed as I'm sending message too quickly.

Did the broker process restart after the OOM? It should terminated and get started again in the Pulsar Helm chart if there's -XX:+ExitOnOutOfMemoryError in your JVM args for the broker? It's in the default PULSAR_GC options in the Apache Pulsar Helm Chart, https://github.com/apache/pulsar-helm-chart/blob/9613ee029290a23e512d5f247bef69faa6bf796a/charts/pulsar/values.yaml#L751 . Are you using the default JVM args that includes -XX:+ExitOnOutOfMemoryError.

Pulsar: 2.9.1

Can you reproduce on Pulsar 2.8.2 ?

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

biggestbull9889 commented 1 year ago

HODLing isn’t an effective risk management process. Timing the market cycles with quantitative algorithms (aka DATA) and getting out at the critical inflection points, is.**

DARK TRACER ** analysts are neither Perma-Crypto Bears nor Bulls. We're opportunistic. We get you long to ride the rips, and then get you OUT to risk manage the dips (and protect your hard-earned capital). [INFO@darktracer.digital](mailto:INFO AT DARKTRACER DIGITAL

Despite falling back significantly from its latest all-time high price, many experts still expect Bitcoin’s price to rise above $100,000 at some point — describing it as a matter of when, not if. Shortly after Bitcoin’s latest all-time high in November, Ethereum marked its own new all-time high when its price went over $4,850. Ethereum has seen similar volatility following the latest high.

Bitcoin hit its first high of the year in 2021 when it went above $60,000 in April, and the price movement since then highlights the cryptocurrency’s volatility in a time when more and more people are interested in getting in on the action. In the weeks between a July low point that took it below $30,000 and its most recent high point in November, Bitcoin swung wildly up and down. The future of cryptocurrency is sure to include plenty more volatility, and experts say this is all par for the course.

We’ve talked to investing experts and financial advisors who advise against sinking much of your portfolio into the asset class for this very reason. They work with clients to make sure volatile crypto investments aren’t getting in the way of other financial priorities, like saving an emergency fund and paying off high-interest debt.