clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Out of Host Capacity error on node activation #29

Open thompsonphys opened 5 years ago

thompsonphys commented 5 years ago

When submitting jobs through slurm, the AMD nodes we've specified in limits.yml are not automatically activating. We then follow the instructions on the elastic scaling page to manually call up a node and receive the error:

2019-06-10 11:29:37,108 startnode  ERROR    bm-standard-e2-64-ad1-0003:  problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}

After trying to launch three node instances, we also get this error:

2019-06-10 11:32:07,976 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}

We've actually had one success in activating a node following this approach, but can't figure out why it worked in that particular case but not in others. Otherwise we are well below the node limit on our given AD. Any ideas?

milliams commented 5 years ago

Hi @thompsonphys,

The 500 error you see happens when Oracle have run out of physical machines to provide you with, regardless of whether your service limit is high enough.

The second error you see should not happen if you created your cluster more recently than the 31st of May as that was when we added in backoff and retry on 429 errors [ACRC/slurm-ansible-playbook#40]. If you made your cluster before that date, could you try recreating it?

thompsonphys commented 5 years ago

Hi @milliams,

Thanks for the quick response. After looking at our ansible logs, we believe that we're on the most up-to-date commit (we initialized the cluster last Friday, June 06):

Starting Ansible Pull at 2019-06-06 10:38:26
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=3 --inventory=/root/hosts management.yml
 [WARNING]: Could not match supplied host pattern, ignoring: mgmt
 [WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
mgmt.subnet.clustervcn.oraclevcn.com | CHANGED => {
    "after": "2b0a76bd523a37cd60e43c343b6b6e3569519210", 
    "before": null, 
    "changed": true
}

One thing to note, though; we were attempting to call the nodes "manually" using e.g.,

sudo scontrol update NodeName=bm-standard-e2-64-ad1-0001 State=POWER_UP

since they weren't activating automatically when a job was submitted, and this could be the reason for the second error. Here's our elastic.log output that led to the error:

2019-06-10 10:58:02,536 startnode  INFO     bm-standard-e2-64-ad1-0002: Starting
2019-06-10 10:58:03,783 startnode  INFO     bm-standard-e2-64-ad1-0002:  No VNIC attachment yet. Waiting...
2019-06-10 10:58:08,840 startnode  INFO     bm-standard-e2-64-ad1-0002:  No VNIC attachment yet. Waiting...
2019-06-10 10:58:14,027 startnode  INFO     bm-standard-e2-64-ad1-0002:   Private IP 10.1.0.5
2019-06-10 10:58:14,042 startnode  INFO     bm-standard-e2-64-ad1-0002:  Started
2019-06-10 11:27:23,249 startnode  INFO     bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:27:27,248 startnode  INFO     bm-standard-e2-64-ad1-0003: Starting
2019-06-10 11:27:32,298 startnode  INFO     bm-standard-e2-64-ad1-0004: Starting
2019-06-10 11:29:14,584 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '8CC851AF29E440FBAAA3E1AA977DAF39/3D8570EDE9F365F4A251F66AC7E5C69D/2EA1C20544804FDC9862EBC3D8079656', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:37,108 startnode  ERROR    bm-standard-e2-64-ad1-0003:  problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:44,097 startnode  ERROR    bm-standard-e2-64-ad1-0004:  problem launching instance: {'opc-request-id': '3C6061CCA87E48AEB2EA6CD54337095A/91C3637438037BA0656DBBB4C6059853/A4EFA370B17C6DEA35473DF758CE1F1E', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:30:44,308 startnode  INFO     bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:32:07,976 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}

The first node started with no difficulties since it was available on your end, but subsequent calls resulted in the 500 error and eventually the 429 error. This morning we were able to load up the additional nodes using the same approach over the same timescale (calling all three back-to-back) and didn't encounter the 429 error.