clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

citc-watchdog service repeatedly restarting #41

Open willfurnass opened 3 years ago

willfurnass commented 3 years ago
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/bin/watchdog", line 8, in <module>
Jul  8 19:10:38 mgmt watchdog[328075]:    sys.exit(main())
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/watchdog.py", line 89, in main
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = utils.get_cloud_nodes()
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/utils.py", line 27, in get_cloud_nodes
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = aws.AwsNode.all(ec2, nodespace)
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in all
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in <listcomp>
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 76, in from_response
Jul  8 19:10:38 mgmt watchdog[328075]:    ip = response["PrivateIpAddress"]
Jul  8 19:10:38 mgmt watchdog[328075]: KeyError: 'PrivateIpAddress'

That response dictionary doesn't contain a PrivateIpAddress key; the dictionary is as follows:

{'AmiLaunchIndex': 0, 'ImageId': 'ami-035ed6bae06963c37', 'InstanceId': 'i-0dbee8bd641226ae8', 'InstanceType': 't3.large', 'KeyName': 'ec2-user-ephemeron', 'LaunchTime': datetime.datetime(2021, 7, 8, 18, 12, 56, tzinfo=tzlocal()), 'Monitoring': {'State': 'disabled'}, 'Placement': {'AvailabilityZone': 'eu-west-1a', 'GroupName': '', 'Tenancy': 'default'}, 'PrivateDnsName': '', 'ProductCodes': [], 'PublicDnsName': '', 'State': {'Code': 48, 'Name': 'terminated'}, 'StateTransitionReason': 'User initiated (2021-07-08 18:35:05 GMT)', 'Architecture': 'x86_64', 'BlockDeviceMappings': [], 'ClientToken': '314e0316-4fad-4d66-9b0b-918590eab1de', 'EbsOptimized': False, 'EnaSupport': True, 'Hypervisor': 'xen', 'NetworkInterfaces': [], 'RootDeviceName': '/dev/sda1', 'RootDeviceType': 'ebs', 'SecurityGroups': [], 'StateReason': {'Code': 'Client.UserInitiatedShutdown', 'Message': 'Client.UserInitiatedShutdown: User initiated shutdown'}, 'Tags': [{'Key': 'Name', 'Value': 'ephemeron-t3-large-0003'}, {'Key': 'type', 'Value': 'compute'}, {'Key': 'cluster', 'Value': 'ephemeron'}], 'VirtualizationType': 'hvm', 'CpuOptions': {'CoreCount': 1, 'ThreadsPerCore': 2}, 'CapacityReservationSpecification': {'CapacityReservationPreference': 'open'}, 'HibernationOptions': {'Configured': False}, 'MetadataOptions': {'State': 'pending', 'HttpTokens': 'optional', 'HttpPutResponseHopLimit': 1, 'HttpEndpoint': 'enabled'}, 'EnclaveOptions': {'Enabled': False}}

@milliams Any thoughts on this? Could this cause problems? Wanting to use CITC for teaching next week :)

(EDIT: line numbers for aws.py in the backtrace are slightly out due to some print calls I've added)

milliams commented 3 years ago

This should not cause any issues. The watchdog's job is to reconcile the state between Slurm and AWS. Currently it only keeps track of things and has not yet learned to correct any issues. You are safe to disable and stop the service.

Are you able to submit jobs and have them start as you expect? If so then this problem can be ignored. If not, then this points towards an issue. It may be that you have some VMs that it's finding, trying to track and failing.

As the the cause of the problem. It seems that when talking to the API, it's not getting back one of the fields that it expects. I will look into this later.

willfurnass commented 3 years ago

Thanks Matt. I'll disable the svc for now to keep the system log cleaner. I have been having some issues starting nodes, which I thought could be related to this but have just realised I've hit my AWS instance limit for my chosen instance type. Doh!

Looks like the API response doesn't contain any NetworkInterface info, which seems odd.