apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.1k stars 3.56k forks source link

PulsarClientException$BrokerMetadataException error reoccurs in release 2.8.1 #12455

Open Raven888888 opened 2 years ago

Raven888888 commented 2 years ago

Describe the bug Exactly same as #11774. Unable to start broker. Also similar to #10967.

Error log

17:01:09.826 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
org.apache.pulsar.broker.PulsarServerException: java.lang.RuntimeException: java.lang.RuntimeException: Failed to initialize Metadata Manager
        at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:787) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:259) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:331) [org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Failed to initialize Metadata Manager
        at org.apache.pulsar.functions.worker.PulsarWorkerService.start(PulsarWorkerService.java:571) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1569) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:759) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        ... 2 more
Caused by: java.lang.RuntimeException: Failed to initialize Metadata Manager
        at org.apache.pulsar.functions.worker.FunctionMetaDataManager.initialize(FunctionMetaDataManager.java:115) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.start(PulsarWorkerService.java:496) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1569) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:759) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        ... 2 more
Caused by: org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: The subscription c-pulsar-cluster-1-fw-192.168.3.3-8080-function-metadata-tailer-reader-7223e8a3e9 of the topic persistent://public/functions/metadata gets the last message id was failed
Failed to get batch size for entry java.lang.IllegalArgumentException: Invalid unknonwn tag type: 6
        at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:987) ~[org.apache.pulsar-pulsar-client-api-2.8.1.jar:2.8.1]
        at org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailable(ConsumerImpl.java:1869) ~[org.apache.pulsar-pulsar-client-original-2.8.1.jar:2.8.1]
        at org.apache.pulsar.client.impl.ReaderImpl.hasMessageAvailable(ReaderImpl.java:168) ~[org.apache.pulsar-pulsar-client-original-2.8.1.jar:2.8.1]
        at org.apache.pulsar.functions.worker.FunctionMetaDataManager.initialize(FunctionMetaDataManager.java:109) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.start(PulsarWorkerService.java:496) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1569) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:759) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        ... 2 more

Expected behavior Able to restart and run broker without any error. Claimed to be fixed by #10968, but I still face the same issue.

Additional context Pulsar version: 2.8.1 Pulsar mode: cluster, 3 nodes KoP: enabled (#10950)

BewareMyPower commented 2 years ago

Could you provide your KoP version? And can you reproduce it?

Raven888888 commented 2 years ago

KoP: 2.8.1.4 I faced this issue after upgrading from pulsar 2.7.0 to 2.8.1. Will try to reproduce when I have the capacity.

BewareMyPower commented 2 years ago

upgrading from pulsar 2.7.0 to 2.8.1

Is it the similar issue like https://github.com/streamnative/kop/issues/765?

Raven888888 commented 2 years ago

Nope @BewareMyPower . That issue only happens in pulsar + kop 2.7.0, and it is due to how kop handles offset calculation as you mentioned. Once upgraded to 2.8.1, I have not seen the offset gap thus far.

This current issue somehow causes broker unable to start at all. Consequently, kop is also not functional.

BewareMyPower commented 2 years ago

Yeah, I also noticed it.

Caused by: org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: The subscription c-pulsar-cluster-1-fw-192.168.3.3-8080-function-metadata-tailer-reader-7223e8a3e9 of the topic persistent://public/functions/metadata gets the last message id was failed
Failed to get batch size for entry java.lang.IllegalArgumentException: Invalid unknonwn tag type: 6
        at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:987) ~[org.apache.pulsar-pulsar-client-api-2.8.1.jar:2.8.1]
        at org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailable(ConsumerImpl.java:1869) ~[org.apache.pulsar-pulsar-client-original-2.8.1.jar:2.8.1]
        at org.apache.pulsar.client.impl.ReaderImpl.hasMessageAvailable(ReaderImpl.java:168) ~[org.apache.pulsar-pulsar-client-original-2.8.1.jar:2.8.1]
        at org.apache.pulsar.functions.worker.FunctionMetaDataManager.initialize(FunctionMetaDataManager.java:109) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.start(PulsarWorkerService.java:496) ~[org.apache.pulsar-pulsar-functions-worker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1569) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]
        at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:759) ~[org.apache.pulsar-pulsar-broker-2.8.1.jar:2.8.1]

It looks like there's something wrong with topic persistent://public/functions/metadata. If you enabled Pulsar Functions before enabling KoP, this topic might not contain the BrokerEntryMetadata part. After enabling KoP, the reader of this topic might fail by ConsumerImpl.hasMessageAvailable.

I'm not familiar with Pulsar Functions, it might take some time to look into the reason.

Raven888888 commented 2 years ago

@BewareMyPower yes you are right. Pulsar function was enabled before enabling KoP. However, just a side note, back in pulsar 2.7.0, such setup wasn't an issue.

Correct me if I am wrong, pulsar functions are stateless, so I could just drop persistent://public/functions/metadata and re-install the pulsar function again?

BewareMyPower commented 2 years ago

It looks like I only replied via email so it didn't appear in this issue.

IMO, it's correct and you can delete this topic manually. Have you tried it?

Raven888888 commented 2 years ago

No I haven't @BewareMyPower , I will try it when I have the capacity, thanks.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

rkaw92 commented 2 years ago

Hey, for what it's worth, I'm seeing this in Pulsar 2.10.1 with KoP:

2022-07-28T12:54:19,013+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [adtonos-platform-prod/platform/persistent/AvailablePlayoutRecognized-partition-12] Opened new cursor: NonDurableCursorImpl{ledger=adtonos-platform-prod/platform/persistent/AvailablePlayoutRecognized-partition-12, ackPos=42:-1, readPos=42:0}
2022-07-28T12:54:19,016+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] ERROR io.streamnative.pulsar.handlers.kop.MessageFetchContext - [persistent://adtonos-platform-prod/platform/AvailablePlayoutRecognized-14] Failed to peekOffsetFromEntry from position 35:4: No BrokerEntryMetadata found
2022-07-28T12:54:19,016+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR io.streamnative.pulsar.handlers.kop.MessageFetchContext - [persistent://adtonos-platform-prod/platform/AvailablePlayoutRecognized-12] Failed to peekOffsetFromEntry from position 42:4: No BrokerEntryMetadata found
2022-07-28T12:54:19,016+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] ERROR io.streamnative.pulsar.handlers.kop.MessageFetchContext - Read entry error on (offset=774095, logStartOffset=-1, maxBytes=1048576)
io.streamnative.pulsar.handlers.kop.exceptions.MetadataCorruptedException$NoBrokerEntryMetadata: No BrokerEntryMetadata found
        at io.streamnative.pulsar.handlers.kop.utils.MessageMetadataUtils.peekOffset(MessageMetadataUtils.java:127) ~[?:?]
        at io.streamnative.pulsar.handlers.kop.utils.MessageMetadataUtils.peekOffsetFromEntry(MessageMetadataUtils.java:118) ~[?:?]
        at io.streamnative.pulsar.handlers.kop.MessageFetchContext$2.readEntriesComplete(MessageFetchContext.java:555) ~[?:?]
        at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$checkReadCompletion$2(OpReadEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.10.1.jar:2.10.1]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.10.1.jar:2.10.1]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.5.jar:4.14.5]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2022-07-28T12:54:19,016+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR io.streamnative.pulsar.handlers.kop.MessageFetchContext - Read entry error on (offset=897452, logStartOffset=-1, maxBytes=1048576)
io.streamnative.pulsar.handlers.kop.exceptions.MetadataCorruptedException$NoBrokerEntryMetadata: No BrokerEntryMetadata found
        at io.streamnative.pulsar.handlers.kop.utils.MessageMetadataUtils.peekOffset(MessageMetadataUtils.java:127) ~[?:?]
        at io.streamnative.pulsar.handlers.kop.utils.MessageMetadataUtils.peekOffsetFromEntry(MessageMetadataUtils.java:118) ~[?:?]
        at io.streamnative.pulsar.handlers.kop.MessageFetchContext$2.readEntriesComplete(MessageFetchContext.java:555) ~[?:?]
        at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$checkReadCompletion$2(OpReadEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.10.1.jar:2.10.1]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.10.1.jar:2.10.1]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.5.jar:4.14.5]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2022-07-28T12:54:19,019+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [adtonos-platform-prod/platform/persistent/AvailablePlayoutRecognized-partition-0] Unable to find position for predicate FindEntryByOffset{ 778995}. Use the first position 27:0 instead.

But I'm able to start the broker - this is at run-time.

It kind of works, but I'm getting a constant spam of this message in the logs.

fishisnow commented 1 year ago

the same problem i meet, the broker cluster is unavailable because of function worker. the error is "The subscription c-pulsar-fw-pulsar-broker-0.pulsar-broker.pulsar.svc.cluster.local.-8080-function-metadata-tailer-reader-28f8e65105 of the topic persistent://public/functions/metadata gets the last message id was failed"