Closed hackdna closed 5 years ago
Did you refresh the page? The blue square should go away. Sometimes, I've had this happen but never really investigated too deeply what makes a node get stuck in the display since the page refresh was a simple fix...
Thanks, it did go away after I refreshed the page. What would happen if this error occurs when a new node is requested by AS? Would the request be retried?
Yes. It's only the display that sometimes gets stuck but the node is removed from the internal tracking list.
Great, thanks. I was also able to verify this just now:
2016-11-07 17:48:19,315 DEBUG autoscale:107 Autoscaling UP: 1 instance(s)
2016-11-07 17:48:19,315 DEBUG master:1311 Adding 1 c4.4xlarge instance(s)
2016-11-07 17:48:19,315 INFO ec2:439 Adding 1 on-demand instance(s)
2016-11-07 17:48:19,335 DEBUG ec2:96 Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:48:19,335 DEBUG ec2:472 Starting instance(s) in VPC with the following command : ec2_conn.run_instances( image_id='ami-3be8cd2c', min_count='1', max_count='1', key_name='cloudman_key_pair', security_group_ids=['sg-8b3b99f3'], user_data(with sensitive info filtered out)=[static_images_dir: static/images
cluster_templates: [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20161101.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'200'}, {'mount_point': '/cvmfs/data.galaxyproject.org', 'type': 'cvmfs', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}]
master_hostname_alt: ip-172-31-61-132
storage_type: volume
iops:
filesystem_templates: []
is_secure: True
cluster_storage_type: volume
s3_port: None
log_level: DEBUG
static_cache_time: 360
master_public_ip: 54.161.121.241
cluster_type: Galaxy
initial_cluster_type: Galaxy
static_scripts_dir: static/scripts
debug: true
master_ip: 172.31.61.132
cluster_name: cm-16.11-dev
machine_image_id: ami-3be8cd2c
role: worker
bucket_cluster: cm-085dd2d743c466efbf1af5854b35dca5
boot_script_path: /opt/cloudman/boot
master_hostname: ip-172-31-61-132.ec2.internal
ec2_conn_path: /
region_name: us-east-1
region_endpoint: ec2.amazonaws.com
ec2_port: None
static_favicon_dir: static/favicon.ico
deployment_version: 2
storage_size: 200
use_translogger: False
boot_script_name: cm_boot.py
services: [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}]
cloud_type: ec2
custom_image_id:
cloudman_file_name: cm.tar.gz
access_key: AKIAJ5CNUF5FSQP3UMVQ
global_conf: {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}
filesystems: [{'kind': 'cvmfs', 'mount_point': '/cvmfs/data.galaxyproject.org', 'name': 'galaxyIndices', 'roles': ['galaxyIndices']}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-04b4ba5293e434aa4']}]
placement: us-east-1c
template_path: templates
cloud_name: Amazon - Virginia
static_dir: static
persistent_data_version: 3
cloudman_home: /mnt/cm
static_style_dir: static/style
bucket_default: cloudman-dev
custom_instance_type: c4.large
s3_host: s3.amazonaws.com
use_lint: false
s3_conn_path: /
static_enabled: True
worker_initial_count: ], instance_type='c4.4xlarge', placement='us-east-1c', subnet_id='subnet-8bd502a1', ebs_optimized='True')
2016-11-07 17:48:19,335 DEBUG ec2:96 Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:49:56,763 ERROR ec2:530 boto server error when starting an instance: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InsufficientInstanceCapacity</Code><Message>We currently do not have sufficient c4.4xlarge capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get c4.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1d, us-east-1e.</Message></Error></Errors><RequestID>547bba21-1c98-4143-a2cd-f701f787ab0d</RequestID></Response>
2016-11-07 17:49:57,260 DEBUG master:2859 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Shut down; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-11-07 17:50:02,285 DEBUG autoscale:151 Checking if cluster too SMALL: minute:50,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-11-07 17:50:02,306 DEBUG autoscale:179 Checking if slow job turnover: queued jobs: 1, avg runtime: 0
2016-11-07 17:50:02,306 DEBUG autoscale:107 Autoscaling UP: 1 instance(s)
2016-11-07 17:50:02,306 DEBUG master:1311 Adding 1 c4.4xlarge instance(s)
2016-11-07 17:50:02,307 INFO ec2:439 Adding 1 on-demand instance(s)
2016-11-07 17:50:02,331 DEBUG ec2:96 Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:50:02,332 DEBUG ec2:472 Starting instance(s) in VPC with the following command : ec2_conn.run_instances( image_id='ami-3be8cd2c', min_count='1', max_count='1', key_name='cloudman_key_pair', security_group_ids=['sg-8b3b99f3'], user_data(with sensitive info filtered out)=[static_images_dir: static/images
cluster_templates: [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20161101.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'200'}, {'mount_point': '/cvmfs/data.galaxyproject.org', 'type': 'cvmfs', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}]
master_hostname_alt: ip-172-31-61-132
storage_type: volume
iops:
filesystem_templates: []
is_secure: True
cluster_storage_type: volume
s3_port: None
log_level: DEBUG
static_cache_time: 360
master_public_ip: 54.161.121.241
cluster_type: Galaxy
initial_cluster_type: Galaxy
static_scripts_dir: static/scripts
debug: true
master_ip: 172.31.61.132
cluster_name: cm-16.11-dev
machine_image_id: ami-3be8cd2c
role: worker
bucket_cluster: cm-085dd2d743c466efbf1af5854b35dca5
boot_script_path: /opt/cloudman/boot
master_hostname: ip-172-31-61-132.ec2.internal
ec2_conn_path: /
region_name: us-east-1
region_endpoint: ec2.amazonaws.com
ec2_port: None
static_favicon_dir: static/favicon.ico
deployment_version: 2
storage_size: 200
use_translogger: False
boot_script_name: cm_boot.py
services: [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}]
cloud_type: ec2
custom_image_id:
cloudman_file_name: cm.tar.gz
access_key: AKIAJ5CNUF5FSQP3UMVQ
global_conf: {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}
filesystems: [{'kind': 'cvmfs', 'mount_point': '/cvmfs/data.galaxyproject.org', 'name': 'galaxyIndices', 'roles': ['galaxyIndices']}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-04b4ba5293e434aa4']}]
placement: us-east-1c
template_path: templates
cloud_name: Amazon - Virginia
static_dir: static
persistent_data_version: 3
cloudman_home: /mnt/cm
static_style_dir: static/style
bucket_default: cloudman-dev
custom_instance_type: c4.large
s3_host: s3.amazonaws.com
use_lint: false
s3_conn_path: /
static_enabled: True
worker_initial_count: ], instance_type='c4.4xlarge', placement='us-east-1c', subnet_id='subnet-8bd502a1', ebs_optimized='True')
2016-11-07 17:50:02,332 DEBUG ec2:96 Getting instance object: Instance:i-0373174cfb6427406
2016-11-07 17:50:06,326 DEBUG ec2:389 Adding tag 'clusterName:cm-16.11-dev' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,409 DEBUG ec2:389 Adding tag 'role:worker' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,501 DEBUG ec2:389 Adding tag 'Name:Worker: cm-16.11-dev' to resource 'i-03e88015316f0016a'
2016-11-07 17:50:06,605 DEBUG ec2:522 Adding Instance Instance:i-03e88015316f0016a
2016-11-07 17:50:06,605 DEBUG ec2:536 Started 1 instance(s)
I've just started a brand new cluster using the Dev 11/01 flavor and requested to add a
c4.4xlarge
worker node using the CM console. However, this failed with the following error message:Also, the requested instance appears to be stuck as a blue square in the CM console while "Remove worker nodes" button is disabled. It would be great to either update the status in the UI or implement automatic retries.