aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
829 stars 312 forks source link

Node not added to slurm scheduler after instance launch #1413

Closed aerogt3 closed 4 years ago

aerogt3 commented 4 years ago

Environment: [aws] aws_region_name = eu-central-1

[cluster default] base_os = ubuntu1604 cluster_type = spot master_instance_type = c5n.2xlarge
compute_instance_type = c5n.18xlarge compute_root_volume_size = 17 enable_efa = compute initial_queue_size = 0 key_name = ub1804 max_queue_size = 20 master_root_volume_size = 25 placement = compute placement_group = DYNAMIC post_install = s3://ber***/pc_post_install.sh extra_json = { "cluster" : { "cfn_scheduler_slots" : "cores" } } s3_read_write_resource = arn:aws:s3:::berd scheduler = slurm shared_dir = data vpc_settings = public

[vpc public] vpc_id = vpc-8cf8b5e7 master_subnet_id = subnet-fb330f90 vpc_security_group_id = sg-****

[global] cluster_template = default update_check = true sanity_check = true

[aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Bug description and how to reproduce: Firstly, most standard slurm commands (qstat, qsub, etc.) don't work because of a missing package: libswitch-perl. Would be useful for this to be in the AMI, but it's being installed nonetheless by post_install.

After launching the cluster, my post_install runs and executes successfully, so I am able to run slurm commands and submit a test job. an ec2 instance spins up, but is not identified by slurm as an available node. Thus, nothing runs. Running "scontrol show node," slurm reports "No nodes in the system"

scontrol show job 2 reveals: JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)

Additional context: From /var/log/sqswacher: more /var/log/sqswatcher 2019-11-04 21:53:08,510 INFO [sqswatcher:main] sqswatcher startup 2019-11-04 21:53:08,510 INFO [sqswatcher:_get_config] Reading /etc/sqswatcher.cfg 2019-11-04 21:53:08,511 INFO [sqswatcher:_get_config] Configured parameters: region=eu-central-1 scheduler=slurm sqsqueue=parallelcluster-berd-cluster-SQS-19IHWLZ0O8WG6 table_name=parallelcluster-berd-cluster-Dy namoDBTable-IFR3A411AD28 cluster_user=ubuntu proxy=NONE stack_name=parallelcluster-berd-cluster 2019-11-04 21:53:08,864 INFO [utils:get_asg_name] ASG parallelcluster-berd-cluster-ComputeFleet-1AMLKX26HVVG1 found for the stack parallelcluster-berd-cluster 2019-11-04 21:53:08,868 INFO [sqswatcher:_poll_queue] Refreshing cluster properties 2019-11-04 21:53:08,943 INFO [utils:get_asg_settings] min/desired/max 0/0/20 2019-11-04 21:53:09,056 INFO [utils:_get_vcpus_by_instance_type] Instance c5n.18xlarge has 72 vcpus. 2019-11-04 21:53:09,056 INFO [utils:_read_cfnconfig] Reading /opt/parallelcluster/cfnconfig 2019-11-04 21:53:09,057 INFO [utils:get_instance_properties] Instance c5n.18xlarge will use number of cores as slots based on configuration. 2019-11-04 21:53:09,057 INFO [utils:get_instance_properties] Number of slots computed for instance c5n.18xlarge: 36 2019-11-04 21:53:09,057 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:53:11,101 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue 2019-11-04 21:53:11,101 INFO [sqswatcher:_parse_sqs_messages] Unsupported event type autoscaling:TEST_NOTIFICATION. Discarding message. 2019-11-04 21:53:11,101 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://eu-central-1.queue.amazonaws.com/855884801609/parallelcluster-berd-cluster-SQS-19IHWLZ0O8WG6', r eceipt_handle='AQEBUR79eEFIdOwp1SHUTcSYPxqdH3o0HdBfuFL38SAIe1VMy9VENIAvT6Bx88hgKNd6XBAPcczneCZOKPGTHq6mcRx+Q3pzKfeze4eL5ZbLn0QOOtLtbcbCb+NudXIZlnwJ6tkJq0h+W3+aVh5EUMMJjBpCuD0sckfyI18SSFMXaO9xJLPzyHhzgGP46DV44yt0 DPJGpuQeE2F9455WSNvEatmmuSy/cr2zn9FtZoeLSgK6NXiYIwxbxYkAKK/mZMh2F3rrMCGKECKo/b658LfXcKLcCX26H0zNEFatsz58JfIRPxeP6lx15qWvrQ+kR98ImhynLU65O6BVMLTUF2jYTzINcGOliAgWYv3AXtrZVmm7G4xSiXZqq3BBO7/nwEvyHJVhJZzXfgZCb0ITRpm lThTEYlh1dcu/hO+ez2w5CWfdy1+MtKql7aiSXsKltEHq') 2019-11-04 21:53:11,128 INFO [slurm:_restart_master_node] Restarting slurm on master node 2019-11-04 21:53:11,594 INFO [slurm:_reconfigure_nodes] Reconfiguring slurm 2019-11-04 21:53:41,824 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:53:43,833 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:54:13,834 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:54:15,845 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:54:45,871 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:54:47,879 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:55:17,907 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:55:19,915 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:55:49,943 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:55:51,952 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:56:21,979 INFO [sqswatcher:_poll_queue] Refreshing cluster properties 2019-11-04 21:56:22,066 INFO [utils:get_asg_settings] min/desired/max 0/1/20 2019-11-04 21:56:22,123 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:56:24,132 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:56:54,162 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:56:56,172 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:57:26,198 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:57:28,207 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:57:58,237 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:58:00,267 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:58:30,297 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:58:32,306 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:59:02,334 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:59:04,342 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 21:59:34,370 INFO [sqswatcher:_poll_queue] Refreshing cluster properties 2019-11-04 21:59:34,442 INFO [utils:get_asg_settings] min/desired/max 0/1/20 2019-11-04 21:59:34,497 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 21:59:36,505 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue 2019-11-04 22:00:06,534 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue 2019-11-04 22:00:08,543 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue and from /var/log/slurmctld.log:

[2019-11-04T21:49:51.595] error: chdir(/var/log): Permission denied [2019-11-04T21:49:51.595] error: Configured MailProg is invalid [2019-11-04T21:49:51.595] slurmctld version 18.08.6-2 started on cluster parallelcluster [2019-11-04T21:49:52.446] No memory enforcing mechanism configured. [2019-11-04T21:49:52.446] layouts: no layout to initialize [2019-11-04T21:49:52.853] error: ################################################ [2019-11-04T21:49:52.854] error: ### SEVERE SECURITY VULERABILTY ### [2019-11-04T21:49:52.854] error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ### [2019-11-04T21:49:52.854] error: ### CORRECT FILE PERMISSIONS ### [2019-11-04T21:49:52.854] error: ################################################ [2019-11-04T21:49:52.854] layouts: loading entities/relations information [2019-11-04T21:49:52.854] error: Could not open node state file /tmp/node_state: No such file or directory [2019-11-04T21:49:52.854] error: NOTE: Trying backup state save file. Information may be lost! [2019-11-04T21:49:52.854] No node state file (/tmp/node_state.old) to recover [2019-11-04T21:49:52.854] error: Could not open job state file /tmp/job_state: No such file or directory [2019-11-04T21:49:52.854] error: NOTE: Trying backup state save file. Jobs may be lost! [2019-11-04T21:49:52.854] No job state file (/tmp/job_state.old) to recover [2019-11-04T21:49:52.854] cons_res: select_p_node_init [2019-11-04T21:49:52.854] cons_res: preparing for 1 partitions [2019-11-04T21:49:52.854] error: Could not open reservation state file /tmp/resv_state: No such file or directory [2019-11-04T21:49:52.854] error: NOTE: Trying backup state save file. Reservations may be lost [2019-11-04T21:49:52.854] No reservation state file (/tmp/resv_state.old) to recover [2019-11-04T21:49:52.854] error: Could not open trigger state file /tmp/trigger_state: No such file or directory [2019-11-04T21:49:52.854] error: NOTE: Trying backup state save file. Triggers may be lost! [2019-11-04T21:49:52.854] No trigger state file (/tmp/trigger_state.old) to recover [2019-11-04T21:49:52.854] _preserve_plugins: backup_controller not specified [2019-11-04T21:49:52.854] Reinitializing job accounting state [2019-11-04T21:49:52.854] cons_res: select_p_reconfigure [2019-11-04T21:49:52.854] cons_res: select_p_node_init [2019-11-04T21:49:52.854] cons_res: preparing for 1 partitions [2019-11-04T21:49:52.854] Running as primary controller [2019-11-04T21:49:52.965] No parameter for mcs plugin, default values set [2019-11-04T21:49:52.965] mcs: MCSParameters = (null). ondemand set. [2019-11-04T21:50:53.093] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2019-11-04T21:53:11.135] Terminate signal (SIGINT or SIGTERM) received [2019-11-04T21:53:11.183] Saving all slurm state [2019-11-04T21:53:11.183] error: Could not open job state file /tmp/job_state: No such file or directory [2019-11-04T21:53:11.183] error: NOTE: Trying backup state save file. Jobs may be lost! [2019-11-04T21:53:11.183] No job state file (/tmp/job_state.old) found [2019-11-04T21:53:11.532] layouts: all layouts are now unloaded. [2019-11-04T21:53:11.549] error: chdir(/var/log): Permission denied [2019-11-04T21:53:11.549] error: Configured MailProg is invalid [2019-11-04T21:53:11.549] slurmctld version 18.08.6-2 started on cluster parallelcluster [2019-11-04T21:53:11.587] No memory enforcing mechanism configured. [2019-11-04T21:53:11.587] layouts: no layout to initialize [2019-11-04T21:53:11.675] error: ################################################ [2019-11-04T21:53:11.675] error: ### SEVERE SECURITY VULERABILTY ### [2019-11-04T21:53:11.675] error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ### [2019-11-04T21:53:11.675] error: ### CORRECT FILE PERMISSIONS ### [2019-11-04T21:53:11.675] error: ################################################ [2019-11-04T21:53:11.675] layouts: loading entities/relations information [2019-11-04T21:53:11.675] Recovered state of 20 nodes [2019-11-04T21:53:11.675] Recovered information about 0 jobs [2019-11-04T21:53:11.675] cons_res: select_p_node_init [2019-11-04T21:53:11.675] cons_res: preparing for 1 partitions [2019-11-04T21:53:11.675] Recovered state of 0 reservations [2019-11-04T21:53:11.675] _preserve_plugins: backup_controller not specified [2019-11-04T21:53:11.675] cons_res: select_p_reconfigure [2019-11-04T21:53:11.675] cons_res: select_p_node_init [2019-11-04T21:53:11.675] cons_res: preparing for 1 partitions [2019-11-04T21:53:11.675] Running as primary controller [2019-11-04T21:53:11.675] No parameter for mcs plugin, default values set [2019-11-04T21:53:11.675] mcs: MCSParameters = (null). ondemand set. [2019-11-04T21:53:11.785] Processing RPC: REQUEST_RECONFIGURE from uid=0 [2019-11-04T21:53:11.785] No memory enforcing mechanism configured. [2019-11-04T21:53:11.785] layouts: no layout to initialize [2019-11-04T21:53:11.786] error: ################################################ [2019-11-04T21:53:11.786] error: ### SEVERE SECURITY VULERABILTY ### [2019-11-04T21:53:11.786] error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ### [2019-11-04T21:53:11.786] error: ### CORRECT FILE PERMISSIONS ### [2019-11-04T21:53:11.786] error: ################################################ [2019-11-04T21:53:11.786] restoring original state of nodes [2019-11-04T21:53:11.786] cons_res: select_p_node_init [2019-11-04T21:53:11.786] cons_res: preparing for 1 partitions [2019-11-04T21:53:11.786] _preserve_plugins: backup_controller not specified [2019-11-04T21:53:11.786] cons_res: select_p_reconfigure [2019-11-04T21:53:11.786] cons_res: select_p_node_init [2019-11-04T21:53:11.786] cons_res: preparing for 1 partitions [2019-11-04T21:53:11.786] No parameter for mcs plugin, default values set [2019-11-04T21:53:11.786] mcs: MCSParameters = (null). ondemand set. [2019-11-04T21:53:11.786] _slurm_rpc_reconfigure_controller: completed usec=1177 [2019-11-04T21:53:14.677] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2019-11-04T21:53:27.748] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=4294901759 usec=623

aerogt3 commented 4 years ago

I get the same issues launching a cluster with sge as the scheduler.

Below are the last lines of /var/log/syslog from the node instance, which maybe is of relevance?

Nov 4 22:57:28 ip-172-31-28-55 amazon-ssm-agent.amazon-ssm-agent[3600]: 2019-11-04 22:57:28 INFO Backing off health check to every 600 seconds for 1800 seconds. Nov 4 22:57:28 ip-172-31-28-55 amazon-ssm-agent.amazon-ssm-agent[3600]: 2019-11-04 22:57:28 ERROR Health ping failed with error - AccessDeniedException: User: arn:aws:sts::855884801609:assumed-role/parallelclus ter-berd-cluster-RootRole-U6DBOTWX5Y0I/i-013ba01481fbaf283 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:eu-central-1:855884801609:instance/i-013ba01481fbaf283 Nov 4 22:57:28 ip-172-31-28-55 amazon-ssm-agent.amazon-ssm-agent[3600]: #011status code: 400, request id: 1eea9791-4e0b-4f84-8dd2-3c01fcea5fd8

ErikLacharite commented 4 years ago

maybe try "additional_sg" instead of "vpc_security_group_id"

aerogt3 commented 4 years ago

I think it has to do with disc usage; the compute node root seemed to get filled up by copy operations, and that caused it not to be added (I think)