[Bug] Consumer trying to read a non-exist ledger

zymap commented 2 years ago

Search before asking

[X] I searched in the issues and found nothing similar.

Version

2.8.1

Minimal reproduce step

Not reproduced yet.

What did you expect to see?

The offload data shouldn't lose.

What did you see instead?

The error message we saw in the broker is caused by the consumer trying to read a non-existing ledger.

2022-08-16T09:22:04.791520867Z 09:22:04.789 [offloader-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader - Failed readOffloaded:
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791545871Z java.lang.NullPointerException: null
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791548867Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.DataBlockUtils.lambda$static$0(DataBlockUtils.java:73) ~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791551588Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.open(BlobStoreBackedReadHandleImpl.java:237) ~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791554123Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader.lambda$readOffloaded$3(BlobStoreManagedLedgerOffloader.java:506) ~[?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791556370Z        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
[pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791565338Z        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) [com.g
oogle.guava-guava-30.1-jre.jar:?]

Details:

we saw some ledgers are offloaded but not seen in the S3.

For example, we found the ledger 1025044 logs. We saw the ledger has been offloaded failed and the offload is trying to clean up the previous offload information and then try to offload it again. But we didn't see successfully offloaded logs.

The weird thing is, the logs show the ledger is failed to offload but the metadata shows it offload successfully.

Screen Shot 2022-09-07 at 17 56 04

In Pulsar offloader implementation, the offload process has three steps.

First, prepare the metadata. The offloader will generate a UUID for the offload ledger and using it get a filename used to offload to the cloud storage. Then it will persist the offload context(including UUID and offloader information, such as offloader name, bucket, and so on) into the meta store (zookeeper).

Second, offload the ledger. It starts offloading the ledger into the cloud storage. The index file name is UUID-ledger--index, and the data file name is UUID-ledger-data.

Once the ledger offloads successfully, it will update the complete in the LedgerInfo.OffloadContext to true and then persist it into the meta store. When it persists successfully, it will update the ledgers map in the memory to the latest status. From now, if there has a consumer who wants to read the ledger, it will read from the tiered storage.

If there has a failure in any steps, it will clean up the offloaded files and do the process again.

Through the logs, we found the zookeeper is not normal since 8:00. Then we found when the ledger run the third step, it failed and throws an exception:

Caused by: org.apache.pulsar.metadata.api.MetadataStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /managed-ledgers/navision/product/persistent/ordered.product_category.navision_model.events-partition-0 And then, the pulsar offloader cannot send the request to the zookeeper successfully. It causes the offload process will do again and again, and what we see in the logs is the ledger continues to offload but never succeed.

There has a case the client will throw ConnectionLoss, that is the request sent to the server successfully, but the client was closed for some other reason and can't receive the response.

This will explain why we see in the metadata, that the offload complete is true, but we can not find it in the S3.
It updates the metadata successfully but returns an error, then our offloader cleans up the ledger. Then the consumer won't get the message.

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

ethqunzhong commented 2 years ago

it seems this PR have fixed. https://github.com/apache/pulsar/pull/17056

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

apache / pulsar