Open willfurnass opened 3 years ago
This should not cause any issues. The watchdog's job is to reconcile the state between Slurm and AWS. Currently it only keeps track of things and has not yet learned to correct any issues. You are safe to disable and stop the service.
Are you able to submit jobs and have them start as you expect? If so then this problem can be ignored. If not, then this points towards an issue. It may be that you have some VMs that it's finding, trying to track and failing.
As the the cause of the problem. It seems that when talking to the API, it's not getting back one of the fields that it expects. I will look into this later.
Thanks Matt. I'll disable the svc for now to keep the system log cleaner. I have been having some issues starting nodes, which I thought could be related to this but have just realised I've hit my AWS instance limit for my chosen instance type. Doh!
Looks like the API response doesn't contain any NetworkInterface
info, which seems odd.
That
response
dictionary doesn't contain aPrivateIpAddress
key; the dictionary is as follows:@milliams Any thoughts on this? Could this cause problems? Wanting to use CITC for teaching next week :)
(EDIT: line numbers for
aws.py
in the backtrace are slightly out due to someprint
calls I've added)