Open amamounelsayed opened 4 years ago
cc @yojagad Related issue #https://github.com/Azure/azure-functions-host/issues/5233
@amamounelsayed - do you remember what the repro for this was? I'm forgetting why the worker was down but hadn't restarted. Was it that the gRPC server was down?
@mhoeger the worker had an underhanded exception and exit. In case it exit with the -1 code the host did not force the restart again, which is ok but the main issue is the host still accepting requests.
Note, worker down means that the gRPC client is down. The gRPC server is on the host side.
@amamounelsayed - I might be wrong, but wasn't this behavior because the gRPC server was shutting down and not the client? I'm just trying to think - I thought we saw this after we merged the exit code -1 PR. And if that's the case, as soon as the worker went down, we would have seen the worker self-heal with a restart (instead of accepting until timeout).
I thought the behavior we saw was that (1) something goes fatally wrong within gRPC communication, (2) we don't see a WorkerError because the process never exits, so RpcWorkerChannel is still accepting requests as if everything is ok. this includes publishing StreamingMessage
's as you linked. (3) In the case where communication broke down, we would see that no messages are available, and the finally block on the grpc server would execute. This disposes the event subscriptions on StreamingMessage
.
Altogether, I think this would cause RpcWorkerChannel to still accept invocations and publish StreamingMessage
's with InvocationRequests without the FunctionRpcService actively listening for those events. So I think this should be tracking a fix for how we handle errors in FunctionRpcService and not the gRPC client/workers?
@mhoeger there is two cases first, when the worker exits, the worker acts like the gRPC client, the host still accepts the requests till the time out. This is the main issue that we were facing before adding the fix to restart the worker.
The second case when we added the fix to restart the worker will continue processing the process but the issue that there were some requests have been sent to the worker and got lost when the worker restarted and these requests will cause the host to restart.
So back to this issue, (1) something goes fatally wrong within gRPC communication, -- worker goes down. (2) we don't see a WorkerError because the process never exits, so RpcWorkerChannel is still accepting requests as if everything is ok. this includes publishing StreamingMessage's as you linked. -- process exit but the host still calls SendStreamingMessage. (3) In the case where communication broke down, we would see that no messages are available, and the finally block on the grpc server would execute. This disposes the event subscriptions on StreamingMessage. -- I do not think this is called as we did not see any https://github.com/Azure/azure-functions-host/blob/398f8264f2d579a71d31ba3f59a442afbbdef643/src/WebJobs.Script/Workers/Rpc/FunctionRpcService.cs#L63 called. But there is probability this is called https://github.com/Azure/azure-functions-host/blob/398f8264f2d579a71d31ba3f59a442afbbdef643/src/WebJobs.Script/Workers/Rpc/FunctionRpcService.cs#L84-L89 but we did not log it.
Thank you both for continued discussion on this. So the root cause, looks like unhandled exception in FunctionRpcService results in a hung state. @mhoeger - if you agree, can you please open a issue for this and assign it to me? Since there are lot details here, just linking this discussion to the new issue will suffice. I will investigate ASAP.
While @mhoeger and I working on #5361 we found that although the worker is down, the host still accepting requests till the function timeout.
When the worker is down, the host still accepts requests and SendInvocationRequest will be called and SendStreamingMessage https://github.com/Azure/azure-functions-host/blob/3eb520c7e7946dd2ebaca6fa00ef2562d86fde00/src/WebJobs.Script/Workers/Rpc/RpcWorkerChannel.cs#L328
but the OutboundEvent will not be triggered from
https://github.com/Azure/azure-functions-host/blob/398f8264f2d579a71d31ba3f59a442afbbdef643/src/WebJobs.Script/Workers/Rpc/FunctionRpcService.cs#L61