envoyproxy / java-control-plane

Java implementation of an Envoy gRPC control plane
Apache License 2.0
293 stars 136 forks source link

SimpleCache setSnapshot fails when removing a cluster #79

Closed sschepens closed 5 years ago

sschepens commented 5 years ago

Setting a new snapshot that removes a cluster fails with the following Exception:

controlplane_1  | 2018-11-02 19:29:37 ERROR c.m.fury.controlplane.xds.XdsCache - xds cache rebuild failed
controlplane_1  | io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
controlplane_1  |  at io.grpc.Status.asRuntimeException(Status.java:517)
controlplane_1  |  at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335)
controlplane_1  |  at io.envoyproxy.controlplane.server.DiscoveryServer$7.send(DiscoveryServer.java:270)
controlplane_1  |  at io.envoyproxy.controlplane.server.DiscoveryServer$7.lambda$null$1(DiscoveryServer.java:214)
controlplane_1  |  at io.envoyproxy.controlplane.cache.Watch.respond(Watch.java:77)
controlplane_1  |  at io.envoyproxy.controlplane.cache.SimpleCache.respond(SimpleCache.java:283)
controlplane_1  |  at io.envoyproxy.controlplane.cache.SimpleCache.lambda$setSnapshot$2(SimpleCache.java:209)
controlplane_1  |  at io.envoyproxy.controlplane.cache.CacheStatusInfo.lambda$watchesRemoveIf$0(CacheStatusInfo.java:142)
controlplane_1  |  at java.base/java.util.Collection.removeIf(Collection.java:492)
controlplane_1  |  at io.envoyproxy.controlplane.cache.CacheStatusInfo.watchesRemoveIf(CacheStatusInfo.java:142)
controlplane_1  |  at io.envoyproxy.controlplane.cache.SimpleCache.setSnapshot(SimpleCache.java:200)
controlplane_1  |  at myapp.controlplane.xds.XdsCache.rebuild(XdsCache.java:125)
controlplane_1  |  at myapp.controlplane.xds.XdsCache.lambda$new$1(XdsCache.java:57)
controlplane_1  |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
controlplane_1  |  at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
controlplane_1  |  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
controlplane_1  |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
controlplane_1  |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
controlplane_1  |  at java.base/java.lang.Thread.run(Thread.java:844)

It would seem that when removing a cluster, envoy cancels the EDS stream asociated with that cluster, but the control plane still tries to synchronously respond to that stream and fails because it has already been cancelled. I would expect this scenario to be handled better, at least not fail the setSnapshot as this prevents further responses to be sent.