Configuration:
Centos 7 (elrepo kernel)
Slurm 16.05.10
etcd 3.1.9
slurm-spank-plugins-0.37
python-etcd 0.34
python-grpcio/ python-grpcio_tools -1.20.1
python-protobuf-3.6.0
one login node and 3 worker nodes (with etcd on each worker)
I'm facing this problem with pcocc-0.4.0, 0.5.0, 0.6.2 and also it's the same using both http and https protocols
Slurm settings configured as mentioned in (stable) installation instruction.
All requirements and dependencies installed.
Such functions as import vm image or show templates work well.
Cluster working well and there are no problems with other tasks and also with creating job for pcocc alloc.
Will be glad for any help
EXAMPLE TASK LOG:
pcocc -vvv alloc -c2 ubuntu:1
DEBUG:root:Loading system config
DEBUG:root:Loading user config
salloc: Granted job allocation 26
salloc: Waiting for resource configuration
salloc: Nodes tesla2 are ready for job
DEBUG:root:Loading system config
DEBUG:root:Loading user config
DEBUG:root:Starting pcocc launcher
DEBUG:root:Launching hypervisors
DEBUG:root:Starting etcd client
DEBUG:etcd.client:New etcd client created for https://tesla2:2379
INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla2
DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61
DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379']
DEBUG:etcd.client:Machines cache initialised to ['https://tesla3:2379', u'https://tesla1:2379']
INFO:root:Started etcd client
DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'recurse': True}
DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts HTTP/1.1" 404 72
Configuring hosts... (waiting for batch manager)DEBUG:etcd.client:About to wait on key /pcocc/cluster/26/state/hosts, index 17
DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True}
DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None
DEBUG:root:Loading system config
DEBUG:root:Loading user config
DEBUG:root:Starting etcd client
DEBUG:etcd.client:New etcd client created for https://tesla3:2379
INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla3
DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61
DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379']
DEBUG:etcd.client:Machines cache initialised to ['https://tesla2:2379', u'https://tesla1:2379']
INFO:root:Started etcd client
DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {}
DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition HTTP/1.1" 404 72
DEBUG:etcd.client:About to wait on key /pcocc/cluster/users/bilenkorv/26/definition, index 17
DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True}
DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None
Configuration: Centos 7 (elrepo kernel) Slurm 16.05.10 etcd 3.1.9 slurm-spank-plugins-0.37 python-etcd 0.34 python-grpcio/ python-grpcio_tools -1.20.1 python-protobuf-3.6.0 one login node and 3 worker nodes (with etcd on each worker) I'm facing this problem with pcocc-0.4.0, 0.5.0, 0.6.2 and also it's the same using both http and https protocols Slurm settings configured as mentioned in (stable) installation instruction. All requirements and dependencies installed. Such functions as import vm image or show templates work well. Cluster working well and there are no problems with other tasks and also with creating job for pcocc alloc.
Will be glad for any help
EXAMPLE TASK LOG:
pcocc -vvv alloc -c2 ubuntu:1 DEBUG:root:Loading system config DEBUG:root:Loading user config salloc: Granted job allocation 26 salloc: Waiting for resource configuration salloc: Nodes tesla2 are ready for job DEBUG:root:Loading system config DEBUG:root:Loading user config DEBUG:root:Starting pcocc launcher DEBUG:root:Launching hypervisors DEBUG:root:Starting etcd client DEBUG:etcd.client:New etcd client created for https://tesla2:2379 INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla2 DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61 DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379'] DEBUG:etcd.client:Machines cache initialised to ['https://tesla3:2379', u'https://tesla1:2379'] INFO:root:Started etcd client DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'recurse': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts HTTP/1.1" 404 72 Configuring hosts... (waiting for batch manager)DEBUG:etcd.client:About to wait on key /pcocc/cluster/26/state/hosts, index 17 DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None DEBUG:root:Loading system config DEBUG:root:Loading user config DEBUG:root:Starting etcd client DEBUG:etcd.client:New etcd client created for https://tesla3:2379 INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla3 DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61 DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379'] DEBUG:etcd.client:Machines cache initialised to ['https://tesla2:2379', u'https://tesla1:2379'] INFO:root:Started etcd client DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition HTTP/1.1" 404 72 DEBUG:etcd.client:About to wait on key /pcocc/cluster/users/bilenkorv/26/definition, index 17 DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None
scontrol job status:
scontrol show job 26 JobId=26 JobName=pcocc UserId=bilenkorv(1626206999) GroupId=domain users(1626200513) MCS_label=N/A Priority=4294901737 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:46 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2021-05-31T21:02:13 EligibleTime=2021-05-31T21:02:13 StartTime=2021-05-31T21:02:13 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=vm AllocNode:Sid=gpu:10752 ReqNodeList=(null) ExcNodeList=(null) NodeList=tesla2 BatchHost=tesla2 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:: TRES=cpu=2,mem=4000M,node=1 Socks/Node= NtasksPerN:B:S:C=0:0::1 CoreSpec=* MinCPUsNode=2 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/share Power=