clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Jobs infinitely pending (Resources) #53

Closed mikeoconnor0308 closed 4 years ago

mikeoconnor0308 commented 4 years ago

Hi,

I've been happily running a cluster for a couple of months now. In the last few days, however, jobs are stuck pending, apparently waiting for resources:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               233   compute test.slm     mike PD       0:00      1 (Resources)

I've confirmed there are no compute nodes up:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      0    n/a

Is there any way to debug what's going on here? sosreport attached:

sosreport-mgmt-mikeoconnor0308-2019-11-20-wsfwtat.tar.zip

I appreciate the version of citc I'm running is quite old (commit id: 91f5b5578ebe9e3118d5240987147ce1b4bdf5ed), but I'd rather not spin up the cluster again right now!

christopheredsall commented 4 years ago

Hi Mike,

Thanks for the sosreport that makes it easy to track down.

Looking at /var/log/slurm/elastic.log (readable with sudo from the opc or citc user) we can see that everything was fine on the 18th

2019-11-18 17:32:54,865 stopnode   INFO     Stopping vm-gpu2-1-ad3-0001
2019-11-18 17:32:55,242 stopnode   INFO      Stopped vm-gpu2-1-ad3-0001

But you started getting "500" errors from the Oracle back end on the 19th

2019-11-19 10:29:52,476 startnode  INFO     vm-gpu2-1-ad3-0001: Starting
2019-11-19 10:31:35,305 startnode  ERROR    vm-gpu2-1-ad3-0001:  problem launching instance: {'opc-request-id': '63FB30C015DE4C75A5CD004CFFCA2BC0/F04951D6CB9BEF1FFED95CEFF12E6036/0A9DD23BAC578D8E8D22FD1E45A4CF77', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}
2019-11-19 10:41:54,531 startnode  INFO     vm-gpu2-1-ad3-0002: Starting
2019-11-19 10:43:51,572 startnode  ERROR    vm-gpu2-1-ad3-0002:  problem launching instance: {'opc-request-id': 'CE6D915856AB435599876B0E21350970/BB28C9E44B515B9E2EF7BE3D65B422E1/2175F6F30DD5164235B3888EB4CE7E41', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-11-19 10:53:56,606 startnode  INFO     vm-gpu2-1-ad3-0003: Starting
2019-11-19 10:56:07,377 startnode  ERROR    vm-gpu2-1-ad3-0003:  problem launching instance: {'opc-request-id': '6E5D22CF78AE48678A7987E6DC750BEA/7FF311BB5CBBD07E240DAA81C8E29760/574E91E3BFB2954F49DF7E16C2D68176', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}

You could try raising a service request with Oracle. I suspect they will suggest you try another "shape", another availability domain or another region. The first two are easy to achieve by editing the limits.yaml file and rerunning finish. Changing region will involve tearing down and rebuilding the cluster (and moving all the data before the terraform destroy.

christopheredsall commented 4 years ago

finish does restart slurmctld

(https://github.com/ACRC/slurm-ansible-playbook/blob/bf3e524ec7a51995e0c07c376184037cfd5e38b0/roles/finalise/files/finish.py#L29)

But that won't have any effect on the cloud service provider's capacity.

mikeoconnor0308 commented 4 years ago

Yeah, you're right, it was capacity problems on oracle.