Azure / azure-functions-nodejs-worker

The Node.js worker for the Azure Functions runtime - https://functions.azure.com
MIT License
107 stars 44 forks source link

"14 UNAVAILABLE: failed to connect to all addresses" exception is thrown by language worker #482

Open alrod opened 3 years ago

alrod commented 3 years ago

On restarting a worker language channel (worker crash or timeout) we need to check for grpc server healthiness and shutdown the host itself if the grpc server is unhealthy.

CRI1 CRI2 https://stackoverflow.com/questions/59823424/grpc-14-unavailable-failed-to-connect-to-all-addresses

pragnagopa commented 3 years ago

Documentation on Channel State API: https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md#channel-state-api

@fabiocav - please find an owner.

TeplrGuy commented 3 years ago

@pragnagopa @fabiocav can we please get an ETA on this? Even an estimation will suffice. Thanks a lot team.

fabiocav commented 2 years ago

@TeplrGuy this has been assigned to sprint 114. We'll continue to update the issue as we make progress.

kshyju commented 2 years ago

@alrod Looked into the functions logs for the error mentioned in the attached CRIs (14 UNAVAILABLE: failed to connect to all addresses) and I can see that this error is reported from the node.js language worker. Queried logs for the last 3 days in CUS and all the entries are coming from node.js worker.

Channel is a connection abstraction on the client side. A channel instance is needed on the client side to establish a connection to a grpc host/server so that a grpc client/stub instance can be created for further communication to the server. On a node.js client. the channel state check should be done using getConnectivityState or watchConnectivityState APIs. (Link to docs)

I think the next action item here is to investigate the node.js language worker implementation to see why it is getting the connectivity error. I did a quick scan on the node.js worker repo and I do not see the above-mentioned APIs are being used. I tried to repro this error locally with a node.js language worker (v14.16.0), but was unsuccessful in doing so(this could be a race condition issue).

Transferring this to node.js worker repo for next steps.

alrod commented 2 years ago

Reopening the issue, this fix will help to recover function host from "failed to connect to all addresses" grpc error: https://github.com/Azure/azure-functions-host/pull/7979

We still can not reproduce "14 UNAVAILABLE: failed to connect to all addresses" error but the fix mentioned above will improve automatic recovering after the error.

alrod commented 2 years ago

Fixing race during language worker start: https://github.com/Azure/azure-functions-host/commit/5fe77113e30dd05da4ae675c97cd80956ea44a7d

ejizba commented 2 years ago

@alrod was this fixed in the linked PR/commit? Or is there still remaining work?

alrod commented 2 years ago

@ejizba, we did some work in the function host to ensure a worker is recovered after "14 UNAVAILABLE: failed to connect to all addresses".

we still want this issue to be opened to get more details or repro steps as it's not clear what leads the worker to the error.