cea-hpc / pcocc

Run VMs on an HPC cluster
GNU General Public License v3.0
47 stars 14 forks source link

Pcocc alloc stuck on Configuring hosts... (waiting for batch manager) #18

Closed romanbilenko174 closed 2 years ago

romanbilenko174 commented 3 years ago

Configuration: Centos 7 (elrepo kernel) Slurm 16.05.10 etcd 3.1.9 slurm-spank-plugins-0.37 python-etcd 0.34 python-grpcio/ python-grpcio_tools -1.20.1 python-protobuf-3.6.0 one login node and 3 worker nodes (with etcd on each worker) I'm facing this problem with pcocc-0.4.0, 0.5.0, 0.6.2 and also it's the same using both http and https protocols Slurm settings configured as mentioned in (stable) installation instruction. All requirements and dependencies installed. Such functions as import vm image or show templates work well. Cluster working well and there are no problems with other tasks and also with creating job for pcocc alloc.

Will be glad for any help

EXAMPLE TASK LOG:


pcocc -vvv alloc -c2 ubuntu:1 DEBUG:root:Loading system config DEBUG:root:Loading user config salloc: Granted job allocation 26 salloc: Waiting for resource configuration salloc: Nodes tesla2 are ready for job DEBUG:root:Loading system config DEBUG:root:Loading user config DEBUG:root:Starting pcocc launcher DEBUG:root:Launching hypervisors DEBUG:root:Starting etcd client DEBUG:etcd.client:New etcd client created for https://tesla2:2379 INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla2 DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61 DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379'] DEBUG:etcd.client:Machines cache initialised to ['https://tesla3:2379', u'https://tesla1:2379'] INFO:root:Started etcd client DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'recurse': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts HTTP/1.1" 404 72 Configuring hosts... (waiting for batch manager)DEBUG:etcd.client:About to wait on key /pcocc/cluster/26/state/hosts, index 17 DEBUG:etcd.client:Issuing read for key /pcocc/cluster/26/state/hosts with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/26/state/hosts?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None DEBUG:root:Loading system config DEBUG:root:Loading user config DEBUG:root:Starting etcd client DEBUG:etcd.client:New etcd client created for https://tesla3:2379 INFO:urllib3.connectionpool:Starting new HTTPS connection (1): tesla3 DEBUG:urllib3.connectionpool:"GET /v2/machines HTTP/1.1" 200 61 DEBUG:etcd.client:Retrieved list of machines: [u'https://tesla1:2379', u'https://tesla2:2379', u'https://tesla3:2379'] DEBUG:etcd.client:Machines cache initialised to ['https://tesla2:2379', u'https://tesla1:2379'] INFO:root:Started etcd client DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition HTTP/1.1" 404 72 DEBUG:etcd.client:About to wait on key /pcocc/cluster/users/bilenkorv/26/definition, index 17 DEBUG:etcd.client:Issuing read for key /pcocc/cluster/users/bilenkorv/26/definition with args {'waitIndex': 17, 'recursive': True, 'timeout': 0, 'wait': True} DEBUG:urllib3.connectionpool:"GET /v2/keys/pcocc/cluster/users/bilenkorv/26/definition?waitIndex=17&recursive=true&wait=true HTTP/1.1" 200 None


scontrol job status:


scontrol show job 26 JobId=26 JobName=pcocc UserId=bilenkorv(1626206999) GroupId=domain users(1626200513) MCS_label=N/A Priority=4294901737 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:46 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2021-05-31T21:02:13 EligibleTime=2021-05-31T21:02:13 StartTime=2021-05-31T21:02:13 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=vm AllocNode:Sid=gpu:10752 ReqNodeList=(null) ExcNodeList=(null) NodeList=tesla2 BatchHost=tesla2 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:: TRES=cpu=2,mem=4000M,node=1 Socks/Node= NtasksPerN:B:S:C=0:0::1 CoreSpec=* MinCPUsNode=2 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/share Power=