Open alrod opened 3 years ago
Documentation on Channel State API: https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md#channel-state-api
@fabiocav - please find an owner.
@pragnagopa @fabiocav can we please get an ETA on this? Even an estimation will suffice. Thanks a lot team.
@TeplrGuy this has been assigned to sprint 114. We'll continue to update the issue as we make progress.
@alrod Looked into the functions logs for the error mentioned in the attached CRIs (14 UNAVAILABLE: failed to connect to all addresses) and I can see that this error is reported from the node.js language worker. Queried logs for the last 3 days in CUS and all the entries are coming from node.js worker.
Channel is a connection abstraction on the client side. A channel instance is needed on the client side to establish a connection to a grpc host/server so that a grpc client/stub instance can be created for further communication to the server. On a node.js client. the channel state check should be done using getConnectivityState or watchConnectivityState APIs. (Link to docs)
I think the next action item here is to investigate the node.js language worker implementation to see why it is getting the connectivity error. I did a quick scan on the node.js worker repo and I do not see the above-mentioned APIs are being used. I tried to repro this error locally with a node.js language worker (v14.16.0), but was unsuccessful in doing so(this could be a race condition issue).
Transferring this to node.js worker repo for next steps.
Reopening the issue, this fix will help to recover function host from "failed to connect to all addresses" grpc error: https://github.com/Azure/azure-functions-host/pull/7979
We still can not reproduce "14 UNAVAILABLE: failed to connect to all addresses" error but the fix mentioned above will improve automatic recovering after the error.
Fixing race during language worker start: https://github.com/Azure/azure-functions-host/commit/5fe77113e30dd05da4ae675c97cd80956ea44a7d
@alrod was this fixed in the linked PR/commit? Or is there still remaining work?
@ejizba, we did some work in the function host to ensure a worker is recovered after "14 UNAVAILABLE: failed to connect to all addresses".
we still want this issue to be opened to get more details or repro steps as it's not clear what leads the worker to the error.
On restarting a worker language channel (worker crash or timeout) we need to check for grpc server healthiness and shutdown the host itself if the grpc server is unhealthy.
CRI1 CRI2 https://stackoverflow.com/questions/59823424/grpc-14-unavailable-failed-to-connect-to-all-addresses