apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

MaxRedirectException occurs when attempting to delete a subscription that was recently created and then had its topic destroyed #12552

Closed zbentley closed 2 years ago

zbentley commented 3 years ago

Describe the bug If I create a tenant/namespace/topic/subscription and then immediately (seconds later) delete and recreate the tenant/namespace/topic, attempting to delete the subscription fails with a 500 response code and a MaxRedirectException. I would expect subscription deletion to fail with a 404 as the subscription should not exist since I just deleted and recreated the tenant/namespace/topic.

To Reproduce This doesn't reproduce 100% of the time. It seems likely to happen more often when it is done on a brand new broker cluster that hasn't ever hosted topics/non-default tenants.

  1. Create a tenant.
  2. Create a namespace in that tenant
  3. Create a persistent, partitioned topic in that namespace.
  4. Attempt to delete a subscription in that topic with a given name, say "foo". DELETE should fail with a 404.
  5. Create that "foo" subscription in that topic and connect at least one consumer to that subscription (no messages need to be produced or consumed).
  6. Immediately thereafter, delete the topic/namespace/tenant.
  7. Do steps 1 through 4 again, with the same-named entities.
  8. Observe that step 4 sometimes (but not always) fails with a stacktrace like the one below.

Expected behavior

  1. Subscription deletion should fail with a 404 or a response that otherwise indicates "there is nothing to delete".
  2. This is probably more important: all operations related to deletion/creation of topics/namespaces/tenants conclude before the management API returns success codes to the user. This bug and the other similar ones I filed (see github links below) all seem to arise from CRUD operations with the management API being asynchronous: i.e. when I create/delete a tenant/topic/namespace, the actual side effects of that creation or deletion (e.g. adding/removing ledgers in BookKeeper, updating metadata in ZK) occur later, not during the API post. Not only is that bound to cause bugs like this, but it's also not what users expect; I would be happy to wait seconds or minutes for management API operations to complete in exchange for knowing that when they successfully complete that the thing I requested has actually been done.

Environment Same environment as https://github.com/apache/pulsar/issues/12551

Stacktrace

URL: DELETE http://pulsar-blt-sn-platform-proxy-headless.pulsar:8080/admin/v2/persistent/blt3/chariot_ns_test/chariot_topic_test/subscription/blt
Headers: Headers({'host': 'pulsar-blt-sn-platform-proxy-headless.pulsar:8080', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.20.0'})
Payload: b''

 --- An unexpected error occurred in the server ---

Message: java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$RetryException: Could not complete the operation. Number of retries has been exhausted. Failed reason: Maximum redirect reached: 5

Stacktrace:

org.apache.pulsar.client.admin.PulsarAdminException: java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$RetryException: Could not complete the operation. Number of retries has been exhausted. Failed reason: Maximum redirect reached: 5
    at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:247)
    at org.apache.pulsar.client.admin.internal.BaseResource$3.failed(BaseResource.java:184)
    at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:882)
    at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:247)
    at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:242)
    at org.glassfish.jersey.client.ClientRuntime.access$100(ClientRuntime.java:62)
    at org.glassfish.jersey.client.ClientRuntime$2.lambda$failure$1(ClientRuntime.java:178)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:288)
    at org.glassfish.jersey.client.ClientRuntime$2.failure(ClientRuntime.java:178)
    at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$apply$1(AsyncHttpConnector.java:204)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$retryOperation$4(AsyncHttpConnector.java:263)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at org.asynchttpclient.netty.NettyResponseFuture.abort(NettyResponseFuture.java:273)
    at org.asynchttpclient.netty.request.NettyRequestSender.abort(NettyRequestSender.java:473)
    at org.asynchttpclient.netty.handler.HttpHandler.readFailed(HttpHandler.java:161)
    at org.asynchttpclient.netty.handler.HttpHandler.handleRead(HttpHandler.java:154)
    at org.asynchttpclient.netty.handler.AsyncHttpClientHandler.channelRead(AsyncHttpClientHandler.java:78)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$RetryException: Could not complete the operation. Number of retries has been exhausted. Failed reason: Maximum redirect reached: 5
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    at java.base/java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1503)
    ... 39 more
Caused by: org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$RetryException: Could not complete the operation. Number of retries has been exhausted. Failed reason: Maximum redirect reached: 5
    at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$retryOperation$4(AsyncHttpConnector.java:265)
    ... 36 more
Caused by: org.asynchttpclient.handler.MaxRedirectException: Maximum redirect reached: 5
zbentley commented 3 years ago

The following issues were all observed in response to similar testing: https://github.com/apache/pulsar-client-cpp/issues/86 https://github.com/apache/pulsar/issues/12556 https://github.com/apache/pulsar/issues/12555 https://github.com/apache/pulsar/issues/12554 https://github.com/apache/pulsar/issues/12553 https://github.com/apache/pulsar/issues/12552 https://github.com/apache/pulsar/issues/12551

The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted.

I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions.

I suspect that the common root cause is that many management API operations are asynchronous that should not be.

Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.

If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.

lhotari commented 2 years ago

The PRs apache/pulsar#13066 apache/pulsar#13069 might fix some of the problems related to this issue.

eolivelli commented 2 years ago

Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.

If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.

I totally agree.

lhotari commented 2 years ago

PR apache/pulsar#13066 has been cherry-picked to branch-2.8 . The fix apache/pulsar#13069 is in progress and I started a separate branch targeting 2.8.x , that is https://github.com/lhotari/pulsar/commits/lh-fix-topic-ownership-assignment-branch-2.8 .

lhotari commented 2 years ago

The PR for branch-2.8 is https://github.com/apache/pulsar/pull/13117 . The goal is to get this fix included in 2.8.2 which will be released next week.

lhotari commented 2 years ago

PR apache/pulsar#12650 also contains relevant fixes. This will also be part of 2.8.2 .

zbentley commented 2 years ago

After building at HEAD this morning, a lot of previous transient errors that occurred when deleting topics (not subscriptions) have since turned into this stacktrace on the client:

E           Message: org.apache.bookkeeper.mledger.ManagedLedgerException: Unexpected condition
E
E           Stacktrace:
E
E           java.util.concurrent.ExecutionException: org.apache.bookkeeper.mledger.ManagedLedgerException: Unexpected condition
E               at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
E               at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
E               at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopicForcefully(PersistentTopicsBase.java:304)
E               at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopic(PersistentTopicsBase.java:1007)
E               at org.apache.pulsar.broker.admin.v2.PersistentTopics.deleteTopic(PersistentTopics.java:894)
E               at jdk.internal.reflect.GeneratedMethodAccessor231.invoke(Unknown Source)
E               at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E               at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E               at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
E               at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
E               at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
E               at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
E               at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
E               at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475)
E               at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397)
E               at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
E               at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255)
E               at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
E               at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
E               at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
E               at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
E               at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
E               at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
E               at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234)
E               at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)
E               at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
E               at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
E               at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
E               at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
E               at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
E               at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
E               at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
E               at org.apache.pulsar.broker.web.ResponseHandlerFilter.doFilter(ResponseHandlerFilter.java:67)
E               at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
E               at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
E               at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
E               at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
E               at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
E               at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
E               at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
E               at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
E               at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
E               at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
E               at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
E               at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
E               at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
E               at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234)
E               at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
E               at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:179)
E               at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
E               at org.eclipse.jetty.server.Server.handle(Server.java:516)
E               at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
E               at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
E               at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
E               at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
E               at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
E               at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
E               at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
E               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
E               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
E               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
E               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
E               at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386)
E               at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
E               at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
E               at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
E               at java.base/java.lang.Thread.run(Thread.java:829)
E           Caused by: org.apache.bookkeeper.mledger.ManagedLedgerException: Unexpected condition

Should I file another bug? Or is that now expected behavior?

lhotari commented 2 years ago

Should I file another bug? Or is that now expected behavior?

@zbentley Yes, please file another issue.

zbentley commented 2 years ago

Done, thanks! https://github.com/apache/pulsar/issues/13172

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

zbentley commented 2 years ago

This no longer repros on 2.9.1. Thanks for the fix!