Open thompsonphys opened 5 years ago
Hi @thompsonphys,
The 500 error you see happens when Oracle have run out of physical machines to provide you with, regardless of whether your service limit is high enough.
The second error you see should not happen if you created your cluster more recently than the 31st of May as that was when we added in backoff and retry on 429 errors [ACRC/slurm-ansible-playbook#40]. If you made your cluster before that date, could you try recreating it?
Hi @milliams,
Thanks for the quick response. After looking at our ansible logs, we believe that we're on the most up-to-date commit (we initialized the cluster last Friday, June 06):
Starting Ansible Pull at 2019-06-06 10:38:26
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=3 --inventory=/root/hosts management.yml
[WARNING]: Could not match supplied host pattern, ignoring: mgmt
[WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
mgmt.subnet.clustervcn.oraclevcn.com | CHANGED => {
"after": "2b0a76bd523a37cd60e43c343b6b6e3569519210",
"before": null,
"changed": true
}
One thing to note, though; we were attempting to call the nodes "manually" using e.g.,
sudo scontrol update NodeName=bm-standard-e2-64-ad1-0001 State=POWER_UP
since they weren't activating automatically when a job was submitted, and this could be the reason for the second error. Here's our elastic.log
output that led to the error:
2019-06-10 10:58:02,536 startnode INFO bm-standard-e2-64-ad1-0002: Starting
2019-06-10 10:58:03,783 startnode INFO bm-standard-e2-64-ad1-0002: No VNIC attachment yet. Waiting...
2019-06-10 10:58:08,840 startnode INFO bm-standard-e2-64-ad1-0002: No VNIC attachment yet. Waiting...
2019-06-10 10:58:14,027 startnode INFO bm-standard-e2-64-ad1-0002: Private IP 10.1.0.5
2019-06-10 10:58:14,042 startnode INFO bm-standard-e2-64-ad1-0002: Started
2019-06-10 11:27:23,249 startnode INFO bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:27:27,248 startnode INFO bm-standard-e2-64-ad1-0003: Starting
2019-06-10 11:27:32,298 startnode INFO bm-standard-e2-64-ad1-0004: Starting
2019-06-10 11:29:14,584 startnode ERROR bm-standard-e2-64-ad1-0001: problem launching instance: {'opc-request-id': '8CC851AF29E440FBAAA3E1AA977DAF39/3D8570EDE9F365F4A251F66AC7E5C69D/2EA1C20544804FDC9862EBC3D8079656', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:37,108 startnode ERROR bm-standard-e2-64-ad1-0003: problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:44,097 startnode ERROR bm-standard-e2-64-ad1-0004: problem launching instance: {'opc-request-id': '3C6061CCA87E48AEB2EA6CD54337095A/91C3637438037BA0656DBBB4C6059853/A4EFA370B17C6DEA35473DF758CE1F1E', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:30:44,308 startnode INFO bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:32:07,976 startnode ERROR bm-standard-e2-64-ad1-0001: problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}
The first node started with no difficulties since it was available on your end, but subsequent calls resulted in the 500 error and eventually the 429 error. This morning we were able to load up the additional nodes using the same approach over the same timescale (calling all three back-to-back) and didn't encounter the 429 error.
When submitting jobs through slurm, the AMD nodes we've specified in limits.yml are not automatically activating. We then follow the instructions on the elastic scaling page to manually call up a node and receive the error:
After trying to launch three node instances, we also get this error:
We've actually had one success in activating a node following this approach, but can't figure out why it worked in that particular case but not in others. Otherwise we are well below the node limit on our given AD. Any ideas?