Better handling for InsufficientInstanceCapacity error

hackdna commented 8 years ago

I've just started a brand new cluster using the Dev 11/01 flavor and requested to add a c4.4xlarge worker node using the CM console. However, this failed with the following error message:

2016-11-07 16:50:23,834 ERROR            ec2:530  boto server error when starting an instance: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InsufficientInstanceCapacity</Code><Message>We currently do not have sufficient c4.4xlarge capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get c4.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1d, us-east-1e.</Message></Error></Errors><RequestID>80586c44-65e6-4676-80f1-e96a47206379</RequestID></Response>

Also, the requested instance appears to be stuck as a blue square in the CM console while "Remove worker nodes" button is disabled. It would be great to either update the status in the UI or implement automatic retries.

afgane commented 8 years ago

Did you refresh the page? The blue square should go away. Sometimes, I've had this happen but never really investigated too deeply what makes a node get stuck in the display since the page refresh was a simple fix...

hackdna commented 8 years ago

Thanks, it did go away after I refreshed the page. What would happen if this error occurs when a new node is requested by AS? Would the request be retried?

afgane commented 8 years ago

Yes. It's only the display that sometimes gets stuck but the node is removed from the internal tracking list.

hackdna commented 8 years ago

Great, thanks. I was also able to verify this just now:

2016-11-07 17:48:19,315 DEBUG      autoscale:107  Autoscaling UP: 1 instance(s)
2016-11-07 17:48:19,315 DEBUG         master:1311 Adding 1 c4.4xlarge instance(s)
2016-11-07 17:48:19,315 INFO             ec2:439  Adding 1 on-demand instance(s)
2016-11-07 17:48:19,335 DEBUG            ec2:96   Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:48:19,335 DEBUG            ec2:472  Starting instance(s) in VPC with the following command : ec2_conn.run_instances( image_id='ami-3be8cd2c', min_count='1', max_count='1', key_name='cloudman_key_pair', security_group_ids=['sg-8b3b99f3'], user_data(with sensitive info filtered out)=[static_images_dir: static/images
cluster_templates: [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20161101.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'200'}, {'mount_point': '/cvmfs/data.galaxyproject.org', 'type': 'cvmfs', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}]
master_hostname_alt: ip-172-31-61-132
storage_type: volume
iops:
filesystem_templates: []
is_secure: True
cluster_storage_type: volume
s3_port: None
log_level: DEBUG
static_cache_time: 360
master_public_ip: 54.161.121.241
cluster_type: Galaxy
initial_cluster_type: Galaxy
static_scripts_dir: static/scripts
debug: true
master_ip: 172.31.61.132
cluster_name: cm-16.11-dev
machine_image_id: ami-3be8cd2c
role: worker
bucket_cluster: cm-085dd2d743c466efbf1af5854b35dca5
boot_script_path: /opt/cloudman/boot
master_hostname: ip-172-31-61-132.ec2.internal
ec2_conn_path: /
region_name: us-east-1
region_endpoint: ec2.amazonaws.com
ec2_port: None
static_favicon_dir: static/favicon.ico
deployment_version: 2
storage_size: 200
use_translogger: False
boot_script_name: cm_boot.py
services: [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}]
cloud_type: ec2
custom_image_id:
cloudman_file_name: cm.tar.gz
access_key: AKIAJ5CNUF5FSQP3UMVQ
global_conf: {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}
filesystems: [{'kind': 'cvmfs', 'mount_point': '/cvmfs/data.galaxyproject.org', 'name': 'galaxyIndices', 'roles': ['galaxyIndices']}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-04b4ba5293e434aa4']}]
placement: us-east-1c
template_path: templates
cloud_name: Amazon - Virginia
static_dir: static
persistent_data_version: 3
cloudman_home: /mnt/cm
static_style_dir: static/style
bucket_default: cloudman-dev
custom_instance_type: c4.large
s3_host: s3.amazonaws.com
use_lint: false
s3_conn_path: /
static_enabled: True
worker_initial_count: ], instance_type='c4.4xlarge', placement='us-east-1c', subnet_id='subnet-8bd502a1', ebs_optimized='True')
2016-11-07 17:48:19,335 DEBUG            ec2:96   Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:49:56,763 ERROR            ec2:530  boto server error when starting an instance: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InsufficientInstanceCapacity</Code><Message>We currently do not have sufficient c4.4xlarge capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get c4.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1d, us-east-1e.</Message></Error></Errors><RequestID>547bba21-1c98-4143-a2cd-f701f787ab0d</RequestID></Response>
2016-11-07 17:49:57,260 DEBUG         master:2859 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Shut down; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-11-07 17:50:02,285 DEBUG      autoscale:151  Checking if cluster too SMALL: minute:50,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-11-07 17:50:02,306 DEBUG      autoscale:179  Checking if slow job turnover: queued jobs: 1, avg runtime: 0
2016-11-07 17:50:02,306 DEBUG      autoscale:107  Autoscaling UP: 1 instance(s)
2016-11-07 17:50:02,306 DEBUG         master:1311 Adding 1 c4.4xlarge instance(s)
2016-11-07 17:50:02,307 INFO             ec2:439  Adding 1 on-demand instance(s)
2016-11-07 17:50:02,331 DEBUG            ec2:96   Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:50:02,332 DEBUG            ec2:472  Starting instance(s) in VPC with the following command : ec2_conn.run_instances( image_id='ami-3be8cd2c', min_count='1', max_count='1', key_name='cloudman_key_pair', security_group_ids=['sg-8b3b99f3'], user_data(with sensitive info filtered out)=[static_images_dir: static/images
cluster_templates: [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20161101.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'200'}, {'mount_point': '/cvmfs/data.galaxyproject.org', 'type': 'cvmfs', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}]
master_hostname_alt: ip-172-31-61-132
storage_type: volume
iops:
filesystem_templates: []
is_secure: True
cluster_storage_type: volume
s3_port: None
log_level: DEBUG
static_cache_time: 360
master_public_ip: 54.161.121.241
cluster_type: Galaxy
initial_cluster_type: Galaxy
static_scripts_dir: static/scripts
debug: true
master_ip: 172.31.61.132
cluster_name: cm-16.11-dev
machine_image_id: ami-3be8cd2c
role: worker
bucket_cluster: cm-085dd2d743c466efbf1af5854b35dca5
boot_script_path: /opt/cloudman/boot
master_hostname: ip-172-31-61-132.ec2.internal
ec2_conn_path: /
region_name: us-east-1
region_endpoint: ec2.amazonaws.com
ec2_port: None
static_favicon_dir: static/favicon.ico
deployment_version: 2
storage_size: 200
use_translogger: False
boot_script_name: cm_boot.py
services: [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}]
cloud_type: ec2
custom_image_id:
cloudman_file_name: cm.tar.gz
access_key: AKIAJ5CNUF5FSQP3UMVQ
global_conf: {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}
filesystems: [{'kind': 'cvmfs', 'mount_point': '/cvmfs/data.galaxyproject.org', 'name': 'galaxyIndices', 'roles': ['galaxyIndices']}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-04b4ba5293e434aa4']}]
placement: us-east-1c
template_path: templates
cloud_name: Amazon - Virginia
static_dir: static
persistent_data_version: 3
cloudman_home: /mnt/cm
static_style_dir: static/style
bucket_default: cloudman-dev
custom_instance_type: c4.large
s3_host: s3.amazonaws.com
use_lint: false
s3_conn_path: /
static_enabled: True
worker_initial_count: ], instance_type='c4.4xlarge', placement='us-east-1c', subnet_id='subnet-8bd502a1', ebs_optimized='True')
2016-11-07 17:50:02,332 DEBUG            ec2:96   Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:50:06,326 DEBUG            ec2:389  Adding tag 'clusterName:cm-16.11-dev' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,409 DEBUG            ec2:389  Adding tag 'role:worker' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,501 DEBUG            ec2:389  Adding tag 'Name:Worker: cm-16.11-dev' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,605 DEBUG            ec2:522  Adding Instance Instance:i-03e88015316f0016a
2016-11-07 17:50:06,605 DEBUG            ec2:536  Started 1 instance(s)

galaxyproject / cloudman

Better handling for InsufficientInstanceCapacity error #64