Closed mikeoconnor0308 closed 4 years ago
Hi Mike,
Thanks for the sosreport that makes it easy to track down.
Looking at /var/log/slurm/elastic.log
(readable with sudo
from the opc
or citc
user) we can see that everything was fine on the 18th
2019-11-18 17:32:54,865 stopnode INFO Stopping vm-gpu2-1-ad3-0001
2019-11-18 17:32:55,242 stopnode INFO Stopped vm-gpu2-1-ad3-0001
But you started getting "500" errors from the Oracle back end on the 19th
2019-11-19 10:29:52,476 startnode INFO vm-gpu2-1-ad3-0001: Starting
2019-11-19 10:31:35,305 startnode ERROR vm-gpu2-1-ad3-0001: problem launching instance: {'opc-request-id': '63FB30C015DE4C75A5CD004CFFCA2BC0/F04951D6CB9BEF1FFED95CEFF12E6036/0A9DD23BAC578D8E8D22FD1E45A4CF77', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}
2019-11-19 10:41:54,531 startnode INFO vm-gpu2-1-ad3-0002: Starting
2019-11-19 10:43:51,572 startnode ERROR vm-gpu2-1-ad3-0002: problem launching instance: {'opc-request-id': 'CE6D915856AB435599876B0E21350970/BB28C9E44B515B9E2EF7BE3D65B422E1/2175F6F30DD5164235B3888EB4CE7E41', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-11-19 10:53:56,606 startnode INFO vm-gpu2-1-ad3-0003: Starting
2019-11-19 10:56:07,377 startnode ERROR vm-gpu2-1-ad3-0003: problem launching instance: {'opc-request-id': '6E5D22CF78AE48678A7987E6DC750BEA/7FF311BB5CBBD07E240DAA81C8E29760/574E91E3BFB2954F49DF7E16C2D68176', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
You could try raising a service request with Oracle. I suspect they will suggest you try another "shape", another availability domain or another region. The first two are easy to achieve by editing the limits.yaml
file and rerunning finish
. Changing region will involve tearing down and rebuilding the cluster (and moving all the data before the terraform destroy
.
finish
does restart slurmctld
But that won't have any effect on the cloud service provider's capacity.
Yeah, you're right, it was capacity problems on oracle.
Hi,
I've been happily running a cluster for a couple of months now. In the last few days, however, jobs are stuck pending, apparently waiting for resources:
I've confirmed there are no compute nodes up:
Is there any way to debug what's going on here? sosreport attached:
sosreport-mgmt-mikeoconnor0308-2019-11-20-wsfwtat.tar.zip
I appreciate the version of citc I'm running is quite old (commit id: 91f5b5578ebe9e3118d5240987147ce1b4bdf5ed), but I'd rather not spin up the cluster again right now!