Open wbadyx opened 3 months ago
In queue_mapping.yml
, is terminate_when_idle
set under the job-shared
section?
The example job script above seems to indicate that you're using MPI. If so, the recommendation would be to use the normal
queue for these jobs. The job-shared
queue is meant for jobs that can share CPU slots on the same instance.
Yes, queue_mapping is set as follows:
job-shared:
queues: ["job-shared"]
# Uncomment to limit the number of concurrent running jobs
# max_running_jobs: 50
# Queue ACLs: https://awslabs.github.io/scale-out-computing-on-aws/tutorials/manage-queue-acls/
allowed_users: [] # empty list = all users can submit job
excluded_users: [] # empty list = no restriction, ["*"] = only allowed_users can submit job
# Queue mode (can be either fifo or fairshare)
# queue_mode: "fifo"
# Instance types restrictions: https://awslabs.github.io/scale-out-computing-on-aws/security/manage-queue-instance-types/
allowed_instance_types: [] # Empty list, all EC2 instances allowed. You can restrict by instance type (Eg: ["c5.4xlarge"]) or instance family (eg: ["c5"])
excluded_instance_types: [] # Empty list, no EC2 instance types prohibited. You can restrict by instance type (Eg: ["c5.4xlarge"]) or instance family (eg: ["c5"])
# List of parameters user can not override: https://awslabs.github.io/scale-out-computing-on-aws/security/manage-queue-restricted-parameters/
restricted_parameters: []
# Default job parameters: https://awslabs.github.io/scale-out-computing-on-aws/tutorials/integration-ec2-job-parameters/
# Scaling mode (can be either single_job, or multiple_jobs): single_job runs a single job per EC2 instance, multiple_jobs allows running multiple jobs on the same EC2 instance
scaling_mode: "multiple_jobs" # Allowed values: single_job, multiple_jobs
instance_type: "c6i.large+c6i.xlarge+c6i.2xlarge" # Required
# instance_ami: "ami-0cf43e890af9e3351" # If you want to enforce a default AMI, make sure it match value of base_os
# base_os: "amazonlinux2" # To enforce a specific operating system for your HPC nodes
# Terminate when idle: The value specifies the default duration (in mins) where the compute instances would be terminated after being detected as free (no jobs running) for N consecutive minutes
terminate_when_idle: 3 # Required when scaling_mode is set to multiple_jobs
ht_support: "true"
placement_group: "false"
root_size: "10"
Even after I changed the script, it still doesn't run. The shell.script:
#!/bin/bash
#PBS -N my_job_name
#PBS -V -j oe -o my_job_name.qlog
#PBS -P project_a
#PBS -q job-shared
## END PBS SETTINGS
/bin/sh script.sh > my_output.log
Does it work if you change the queue name in shell.script
to normal
?
#PBS -q normal
YES , but the normal queue is "single_job" scaling_mode.
I was able to reproduce this issue in SOCA 2.7.5.
The fix is to:
when I submit tasks to the job-shared queue, the queue status always remains QUEUED without any errors.When I change the job-shared queue scaling_mode to single_job and then it works. I found that the difference lies in the "scaling mode: multiple_jobs". Do you know what might be causing this issue? The shell.script:
job-shared log:
Thank you for your help.