unusual SGE error logs when spot instances in cluster are lost #912

Closed keien closed 5 years ago

keien commented 5 years ago


Bug description and how to reproduce: I've found that when AWS takes away spot instances, sometimes the cleanup happens correctly and doesn't leave zombie nodes behind, whereas in other cases it leaves zombie nodes behind that hold onto jobs in the r state.

Yesterday one of our users started a massive job involving some 50+ p3.2xlarge spot instances, which are highly volatile, which resulted in some 15+ zombie nodes when I checked this morning. I saw some unusual logs in /var/log/sqswatcher so I thought I'd report it. See below:

Additional context:

2019-03-04 19:17:11,318 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-ah']

2019-03-04 19:17:11,333 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-as']

error: " " is the only character allowed between the attribute name and the value in line 2
error: error reading file: "/tmp/tmpozx5sP"
invalid format
2019-03-04 19:17:11,349 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-Ae', '/tmp/tmpozx5sP']

2019-03-04 19:17:11,349 INFO [sge:addHost] Connecting to host:  iter: 0
2019-03-04 19:17:11,350 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:21,360 INFO [sge:addHost] Connecting to host:  iter: 1
2019-03-04 19:17:21,362 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:32,369 INFO [sge:addHost] Connecting to host:  iter: 2
2019-03-04 19:17:32,372 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:44,372 CRITICAL [sge:addHost] Unable to provison host
Traceback (most recent call last):
  File "/usr/bin/sqswatcher", line 11, in <module>
  File "/usr/lib/python2.7/site-packages/sqswatcher/", line 219, in main
    pollQueue(scheduler, q, t, proxy_config)
  File "/usr/lib/python2.7/site-packages/sqswatcher/", line 170, in pollQueue
    raise e
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the PutItem operation: One or more parameter values were invalid: An AttributeValue may not contain an empty string
2019-03-04 19:17:45,937 INFO [sqswatcher:main] sqswatcher startup
2019-03-04 19:17:46,188 INFO [sqswatcher:pollQueue] eventType=autoscaling:EC2_INSTANCE_TERMINATE
2019-03-04 19:17:46,188 INFO [sqswatcher:pollQueue] instanceId=i-0028ff6c36f76ad2c
2019-03-04 19:17:46,222 INFO [sge:removeHost] Removing ip-172-31-128-10 removed "" from administrative host list modified "all.q" in cluster queue list modified "@allhosts" in host group list removed "" from execution host list removed "" from submit host list

We had a bunch of these as spot instances were being taken away from us.

sean-smith commented 5 years ago

@keien Thanks for the bug report! I'm labelling this as a bug and we'll update this thread when we have a resolution.

enrico-usai commented 5 years ago

@keien I'm going to close this issue since it has been already solved by and released with the 2.2.1 version.

The same issue was already reported here:

Please let us know if you have any questions.