The workers on my helm deployment cannot seem to startup - there are three workers, two on nodes that have GPUs and one that doesn't. Here is the error I find in the worker logs:
Any ideas on how I can troubleshoot this? I can port-forward into the master service, but I'm not sure how to verify that the gprc server is running. I've seen a log where the hostname to the master service resolved to the correct ip address of the master service, but still timed out. Looking for some advice on what to look for.
Hello,
The workers on my helm deployment cannot seem to startup - there are three workers, two on nodes that have GPUs and one that doesn't. Here is the error I find in the worker logs:
The master appears happy, it has the following logs:
Any ideas on how I can troubleshoot this? I can port-forward into the master service, but I'm not sure how to verify that the gprc server is running. I've seen a log where the hostname to the master service resolved to the correct ip address of the master service, but still timed out. Looking for some advice on what to look for.
Thanks