aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
826 stars 312 forks source link

unusual SGE error logs when spot instances in cluster are lost #912

Closed keien closed 5 years ago

keien commented 5 years ago

Environment:

Bug description and how to reproduce: I've found that when AWS takes away spot instances, sometimes the cleanup happens correctly and doesn't leave zombie nodes behind, whereas in other cases it leaves zombie nodes behind that hold onto jobs in the r state.

Yesterday one of our users started a massive job involving some 50+ p3.2xlarge spot instances, which are highly volatile, which resulted in some 15+ zombie nodes when I checked this morning. I saw some unusual logs in /var/log/sqswatcher so I thought I'd report it. See below:

Additional context:

2019-03-04 19:17:11,318 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-ah']

error: no option argument provided to "-as"
SGE 8.1.9
usage: qconf [options]
   [-aattr obj_nm attr_nm val obj_id_list]  add to a list attribute of an object
   [-Aattr obj_nm fname obj_id_list]        add to a list attribute of an object
   [-acal calendar_name]                    add a new calendar
   [-Acal fname]                            add a new calendar from file
   [-Ackpt fname]                           add a checkpointing interface definition from file
   [-aconf host_list]                       add configurations
   [-Aconf file_list]                       add configurations from file_list
   [-ae [exec_server_template]]             add an exec host using a template
   [-Ae fname]                              add an exec host from file
   [-ah hostname_list]                      add an administrative host
   [-ahgrp group]                           add new host group entry
   [-Ahgrp file]                            add new host group entry from file
   [-arqs [rqs_list]]                       add resource quota set(s)
   [-Arqs fname]                            add resource quota set(s) from file
   [-am user_list]                          add user to manager list
   [-ao user_list]                          add user to operator list
   [-ap pe-name]                            add a new parallel environment
   [-aprj]                                  add project
   [-Aprj fname]                            add project from file
   [-aq [queue_name]]                       add a new cluster queue
   [-Aq fname]                              add a queue from file
   [-ar ar_id]                              bind job to advance reservation
   [-as hostname_list]                      add a submit host
   [-astree]                                create/modify the sharetree
   [-Astree fname]                          create/modify the sharetree from file
   [-au user_list listname_list]            add user(s) to userset list(s)
   [-Au fname]                              add userset from file
   [-auser]                                 add user
   [-Auser fname]                           add user from file
   [-ckpt ckpt-name]                        request checkpoint method
   [-clear]                                 skip previous definitions for job
   [-clearusage [user_list]]                clear sharetree usage for user_list or all users/projects
   [-cq wc_queue_list]                      clean queue
   [-dattr obj_nm attr_nm val obj_id_list]  delete from a list attribute of an object
   [-Dattr obj_nm fname obj_id_list]        delete from a list attribute of an object
   [-dcal calendar_name]                    delete calendar
   [-dckpt ckpt_name]                       delete checkpointing interface definition
   [-dconf host_list]                       delete local configurations
   [-de host_list]                          delete exec host
   [-display display]                       set DISPLAY variable inside interactive job (not available for qrsh without command)
   [-dh host_list]                          delete administrative host
   [-dhgrp group]                           delete host group entry
   [-dl date_time]                          request a deadline initiation time
   [-drqs rqs_list]                         delete resource quota set(s)
   [-dm user_list]                          delete user from manager list
   [-do user_list]                          delete user from operator list
   [-dp pe-name]                            delete parallel environment
   [-dprj project_list]                     delete project
   [-dq wc_queue_list]                      delete queue
   [-ds host_list]                          delete submit host
   [-dstnode node_list]                     delete sharetree node(s)
   [-dstree]                                delete the sharetree
   [-du user_list listname_list]            delete user(s) from userset list(s)
   [-duser user_list]                       delete user(s)
   [-help]                                  print this help
   [-ke[j] host_list                        shutdown execution daemon(s)
   [-k{m|s}]                                shutdown master|scheduling thread
   [-kec evid_list]                         kill event client
   [-kt thread_name]                        kill qmaster thread
   [-Mattr obj_nm fname obj_id_list]        modify an attribute (or element in a sublist) of an object
   [-mc ]                                   modify complex attributes
   [-Mc fname]                              modify complex attributes from file
   [-mcal calendar_name]                    modify calendar
   [-Mcal fname]                            modify calendar from file
   [-mckpt ckpt_name]                       modify a checkpointing interface definition
   [-Mckpt fname]                           modify a checkpointing interface definition from file
   [-mconf [host_list|global]]              modify configurations
   [-Mconf file_list]                       modify configurations from file_list
   [-me server]                             modify exec server
   [-Me fname]                              modify exec server from file
   [-mhgrp group]                           modify host group entry
   [-Mhgrp file]                            modify host group entry from file
   [-mrqs [rqs_list]]                       modify resource quota set(s)
   [-Mrqs fname [rqs_list]]                 modify resource quota set(s) from file
   [-mp pe-name]                            modify a parallel environment
   [-Mp fname]                              modify a parallel environment from file
   [-mprj project]                          modify a project
   [-Mprj fname]                            modify project from file
   [-mq queue]                              modify a queue
   [-Mq fname]                              modify a queue from file
   [-msconf]                                modify scheduler configuration
   [-Msconf fname]                          modify scheduler configuration from file
   [-mstnode node_shares_list]              modify sharetree node(s)
   [-mstree]                                modify/create the sharetree
   [-Mstree fname]                          modify/create the sharetree from file
   [-mu listname_list]                      modify the given userset list
   [-Mu fname]                              modify userset from file
   [-muser user]                            modify a user
   [-Muser fname]                           modify a user from file
   [-purge obj_nm3 attr_nm objectname]      deletes attribute from object_instance
   [-R y[es]|n[o]]                          reservation desired
   [-rattr obj_nm attr_nm val obj_id_list]  replace a list attribute of an object
   [-Rattr obj_nm fname obj_id_list]        replace a list attribute of an object
   [-rsstnode node_list]                    show sharetree node(s) and its children
   [-sc]                                    show complex attributes
   [-scal calendar_name]                    show given calendar
   [-scall]                                 show a list of all calendar names
   [-sckpt ckpt_name]                       show checkpointing interface definition
   [-sckptl]                                show all checkpointing interface definitions
   [-sconf [host_list|global]]              show configurations
   [-sconfl]                                show a list of all local configurations
   [-sds]                                   show detached settings
   [-se server]                             show given exec server
   [-secl]                                  show event client list
   [-sel]                                   show a list of all exec servers
   [-sh]                                    show a list of all administrative hosts
   [-shgrp group]                           show host group
   [-shgrp_tree group]                      show host group and used hostgroups as tree
   [-shgrp_resolved group]                  show host group with resolved hostlist
   [-shgrpl]                                show host group list
   [-sm]                                    show a list of all managers
   [-so]                                    show a list of all operators
   [-sobjl obj_nm2 attr_nm val]             show objects which match the given value
   [-sp pe-name]                            show a parallel environment
   [-spl]                                   show all parallel environments
   [-sprj project]                          show a project
   [-sprjl]                                 show a list of all projects
   [-sq [wc_queue_list]]                    show the given queue
   [-sql]                                   show a list of all queues
   [-srqsl]                                 show resource quota set list
   [-ss]                                    show a list of all submit hosts
   [-ssconf]                                show scheduler configuration
   [-sstnode node_list]                     show sharetree node(s)
   [-sst]                                   show a formatted sharetree
   [-sstree]                                show the sharetree
   [-su listname_list]                      show the given userset list
   [-sul]                                   show a list of all userset lists
   [-suser user_list]                       show user(s)
   [-suserl]                                show a list of all users
   [-tc max_running_tasks]                  throttle the number of concurrent tasks
   [-tsm]                                   trigger scheduler monitoring
   [-verbose]                               verbose information output
   [-w e|w|n|v|p]                           verify mode (error|warning|none|just verify|poke) for jobs

complex_list            complex[,complex,...]
date_time               [[CC]YY]MMDDhhmm[.SS]
wc_queue_list           wc_queue[,wc_queue,...]
hostname_list           hostname[,hostname,...]
listname_list           listname[,listname,...]
rqs_list                rqs_name[,rqs_name,...]
node_list               node_path[,node_path,...]
node_path               [/]node_name[[/.]node_name...]
node_shares_list        node_path=shares[,node_path=shares,...]
user_list               user[,user,...]
obj_nm                  "queue"|"exechost"|"pe"|"ckpt"|"hostgroup"|"resource_quota"
attr_nm                 (see man pages)
obj_id_list             objectname [ objectname ...]
project_list            project[,project,...]
evid_list               all | evid[,evid,...]
host_list               all | hostname[,hostname,...]
obj_nm2                 "queue"|"queue_domain"|"queue_instance"|"exechost"
obj_nm3                 "queue"
ar_id                   advance reservation id
thread_name             "scheduler"|"jvm"
max_running_tasks       maximum number of simultaneously running tasks
2019-03-04 19:17:11,333 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-as']

error: " " is the only character allowed between the attribute name and the value in line 2
error: error reading file: "/tmp/tmpozx5sP"
invalid format
2019-03-04 19:17:11,349 ERROR [sge:__runSgeCommand] Failed to run ['/opt/sge/bin/lx-amd64/qconf', '-Ae', '/tmp/tmpozx5sP']

2019-03-04 19:17:11,349 INFO [sge:addHost] Connecting to host:  iter: 0
2019-03-04 19:17:11,350 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:21,360 INFO [sge:addHost] Connecting to host:  iter: 1
2019-03-04 19:17:21,362 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:32,369 INFO [sge:addHost] Connecting to host:  iter: 2
2019-03-04 19:17:32,372 ERROR [sge:addHost] Socket error: [Errno -2] Name or service not known
2019-03-04 19:17:44,372 CRITICAL [sge:addHost] Unable to provison host
Traceback (most recent call last):
  File "/usr/bin/sqswatcher", line 11, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/sqswatcher/sqswatcher.py", line 219, in main
    pollQueue(scheduler, q, t, proxy_config)
  File "/usr/lib/python2.7/site-packages/sqswatcher/sqswatcher.py", line 170, in pollQueue
    raise e
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the PutItem operation: One or more parameter values were invalid: An AttributeValue may not contain an empty string
2019-03-04 19:17:45,937 INFO [sqswatcher:main] sqswatcher startup
2019-03-04 19:17:46,188 INFO [sqswatcher:pollQueue] eventType=autoscaling:EC2_INSTANCE_TERMINATE
2019-03-04 19:17:46,188 INFO [sqswatcher:pollQueue] instanceId=i-0028ff6c36f76ad2c
2019-03-04 19:17:46,222 INFO [sge:removeHost] Removing ip-172-31-128-10
root@ip-172-31-128-194.us-west-2.compute.internal removed "ip-172-31-128-10.us-west-2.compute.internal" from administrative host list
root@ip-172-31-128-194.us-west-2.compute.internal modified "all.q" in cluster queue list
root@ip-172-31-128-194.us-west-2.compute.internal modified "@allhosts" in host group list
root@ip-172-31-128-194.us-west-2.compute.internal removed "ip-172-31-128-10.us-west-2.compute.internal" from execution host list
root@ip-172-31-128-194.us-west-2.compute.internal removed "ip-172-31-128-10.us-west-2.compute.internal" from submit host list

We had a bunch of these as spot instances were being taken away from us.

sean-smith commented 5 years ago

@keien Thanks for the bug report! I'm labelling this as a bug and we'll update this thread when we have a resolution.

enrico-usai commented 5 years ago

@keien I'm going to close this issue since it has been already solved by https://github.com/aws/aws-parallelcluster-node/pull/94 and released with the 2.2.1 version.

The same issue was already reported here: https://github.com/aws/aws-parallelcluster/issues/566 https://github.com/aws/aws-parallelcluster/issues/743

Please let us know if you have any questions.