awslabs / scale-out-computing-on-aws

Scale-Out Computing on AWS is a solution that helps customers deploy and operate a multiuser environment for computationally intensive workflows.
https://awslabs.github.io/scale-out-computing-on-aws-documentation/
Apache License 2.0
124 stars 59 forks source link

Queue Status Remaining QUEUED Due to Multiple_Jobs Scaling Mode #136

Open wbadyx opened 3 months ago

wbadyx commented 3 months ago

when I submit tasks to the job-shared queue, the queue status always remains QUEUED without any errors.When I change the job-shared queue scaling_mode to single_job and then it works. I found that the difference lies in the "scaling mode: multiple_jobs". Do you know what might be causing this issue? The shell.script:

#!/bin/bash
#PBS -N my_job_name
#PBS -V -j oe -o my_job_name.qlog
#PBS -P project_a
#PBS -q job-shared
#PBS -l nodes=1
## END PBS SETTINGS
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE | sort | uniq > mpi_nodes
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apps/openmpi/5.0.3/lib/
export PATH=$PATH:/apps/openmpi/5.0.3/bin/
/apps/openmpi/5.0.3/bin/mpirun --hostfile mpi_nodes -np 1 script.sh > my_output.log

job-shared log:

[2024-07-26 06:01:08,436] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:02:07,161] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:03:07,976] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:04:08,647] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:05:07,928] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:06:10,072] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:07:07,179] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:08:08,224] [208] [INFO] [Queue provisioning: fifo, scaling mode: multiple_jobs]
[2024-07-26 06:09:09,262] [208] [INFO] [Queue provisioning: fifo, scaling mode: single_job]
[2024-07-26 06:09:09,321] [208] [INFO] [================================================================]
[2024-07-26 06:09:09,322] [208] [INFO] [Detected Default Parameters for this queue: {'queues': ['job-shared'], 'scaling_mode': 'single_job', 'instance_type': 'c6i.large+c6i.xlarge+c6i.2xlarge', 'terminate_when_idle': 3, 'ht_support': 'true', 'placement_group': 'false', 'root_size': '10'}]
[2024-07-26 06:09:09,322] [208] [INFO] [Licenses Available: {}]
[2024-07-26 06:09:09,323] [208] [INFO] [Checking if we have enough resources available to run job_5]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for ncpus. Creating new entry with value: 1]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for nodect. Creating new entry with value: 1]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for nodes. Creating new entry with value: 1]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for place. Creating new entry with value: scatter]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for select. Creating new entry with value: 1:ncpus=1]
[2024-07-26 06:09:09,323] [208] [INFO] [No default value for compute_node. Creating new entry with value: tbd]
[2024-07-26 06:09:09,378] [208] [INFO] [job_5 can run, doing dry run test with following parameters: c6i.large+c6i.xlarge+c6i.2xlarge *  1]
[2024-07-26 06:09:12,413] [208] [INFO] [5 : compute_node=job5 | stack_id=soca-wba-job-5]
[2024-07-26 06:09:12,413] [208] [INFO] [select variable: 1:ncpus=1:compute_node=job5]
[2024-07-26 06:09:12,493] [208] [INFO] [Checking if we have enough resources available to run job_6]
[2024-07-26 06:09:12,493] [208] [INFO] [No default value for ncpus. Creating new entry with value: 1]
[2024-07-26 06:09:12,493] [208] [INFO] [No default value for nodect. Creating new entry with value: 1]
[2024-07-26 06:09:12,493] [208] [INFO] [No default value for place. Creating new entry with value: pack]
[2024-07-26 06:09:12,493] [208] [INFO] [No default value for select. Creating new entry with value: 1:ncpus=1]
[2024-07-26 06:09:12,493] [208] [INFO] [No default value for compute_node. Creating new entry with value: tbd]
[2024-07-26 06:09:12,509] [208] [INFO] [job_6 can run, doing dry run test with following parameters: c6i.large+c6i.xlarge+c6i.2xlarge *  1]
[2024-07-26 06:09:14,997] [208] [INFO] [6 : compute_node=job6 | stack_id=soca-wba-job-6]
[2024-07-26 06:09:14,997] [208] [INFO] [select variable: 1:ncpus=1:compute_node=job6]
[2024-07-26 06:10:09,511] [208] [INFO] [Queue provisioning: fifo, scaling mode: single_job]
[2024-07-26 06:10:09,565] [208] [INFO] [Checking existing cloudformation soca-wba-job-5]
[2024-07-26 06:10:09,666] [208] [INFO] [5 is queued but CI has been specified and CloudFormation has been created.]
[2024-07-26 06:10:09,666] [208] [INFO] [5 Stack has been created for less than 30 minutes. Let's wait a bit before killing the CI and resetting the compute_node value]
[2024-07-26 06:10:09,666] [208] [INFO] [Skipping 5 as this job already has a valid compute node]
[2024-07-26 06:10:09,666] [208] [INFO] [Checking existing cloudformation soca-wba-job-6]
[2024-07-26 06:10:09,716] [208] [INFO] [6 is queued but CI has been specified and CloudFormation has been created.]
[2024-07-26 06:10:09,717] [208] [INFO] [6 Stack has been created for less than 30 minutes. Let's wait a bit before killing the CI and resetting the compute_node value]
[2024-07-26 06:10:09,717] [208] [INFO] [Skipping 6 as this job already has a valid compute node]
[2024-07-26 06:10:09,717] [208] [INFO] [================================================================]
[2024-07-26 06:10:09,717] [208] [INFO] [Detected Default Parameters for this queue: {'queues': ['job-shared'], 'scaling_mode': 'single_job', 'instance_type': 'c6i.large+c6i.xlarge+c6i.2xlarge', 'terminate_when_idle': 3, 'ht_support': 'true', 'placement_group': 'false', 'root_size': '10'}]
[2024-07-26 06:10:09,717] [208] [INFO] [Licenses Available: {}]
[2024-07-26 06:10:09,717] [208] [INFO] [Skip 5]
[2024-07-26 06:10:09,717] [208] [INFO] [Skip 6]

Thank you for your help.

ahmedelz commented 3 months ago

In queue_mapping.yml, is terminate_when_idle set under the job-shared section?

The example job script above seems to indicate that you're using MPI. If so, the recommendation would be to use the normal queue for these jobs. The job-shared queue is meant for jobs that can share CPU slots on the same instance.

wbadyx commented 3 months ago

Yes, queue_mapping is set as follows:

job-shared:
    queues: ["job-shared"]
    # Uncomment to limit the number of concurrent running jobs
    # max_running_jobs: 50
    # Queue ACLs:  https://awslabs.github.io/scale-out-computing-on-aws/tutorials/manage-queue-acls/
    allowed_users: [] # empty list = all users can submit job
    excluded_users: [] # empty list = no restriction, ["*"] = only allowed_users can submit job
    # Queue mode (can be either fifo or fairshare)
    # queue_mode: "fifo"
    # Instance types restrictions: https://awslabs.github.io/scale-out-computing-on-aws/security/manage-queue-instance-types/
    allowed_instance_types: [] # Empty list, all EC2 instances allowed. You can restrict by instance type (Eg: ["c5.4xlarge"]) or instance family (eg: ["c5"])
    excluded_instance_types: [] # Empty list, no EC2 instance types prohibited.  You can restrict by instance type (Eg: ["c5.4xlarge"]) or instance family (eg: ["c5"])
    # List of parameters user can not override: https://awslabs.github.io/scale-out-computing-on-aws/security/manage-queue-restricted-parameters/
    restricted_parameters: []
    # Default job parameters: https://awslabs.github.io/scale-out-computing-on-aws/tutorials/integration-ec2-job-parameters/
    # Scaling mode (can be either single_job, or multiple_jobs): single_job runs a single job per EC2 instance, multiple_jobs allows running multiple jobs on the same EC2 instance
    scaling_mode: "multiple_jobs" # Allowed values: single_job, multiple_jobs
    instance_type: "c6i.large+c6i.xlarge+c6i.2xlarge" # Required
    # instance_ami: "ami-0cf43e890af9e3351" # If you want to enforce a default AMI, make sure it match value of base_os
    # base_os: "amazonlinux2" # To enforce a specific operating system for your HPC nodes
    # Terminate when idle: The value specifies the default duration (in mins) where the compute instances would be terminated after being detected as free (no jobs running) for N consecutive minutes
    terminate_when_idle: 3 # Required when scaling_mode is set to multiple_jobs
    ht_support: "true"
    placement_group: "false"
    root_size: "10"

Even after I changed the script, it still doesn't run. The shell.script:

#!/bin/bash
#PBS -N my_job_name
#PBS -V -j oe -o my_job_name.qlog
#PBS -P project_a
#PBS -q job-shared
## END PBS SETTINGS
/bin/sh script.sh > my_output.log
ahmedelz commented 3 months ago

Does it work if you change the queue name in shell.script to normal? #PBS -q normal

wbadyx commented 3 months ago

YES , but the normal queue is "single_job" scaling_mode.

ahmedelz commented 3 months ago

I was able to reproduce this issue in SOCA 2.7.5.

The fix is to:

  1. Uncomment lines 54-55 in queue_mapping.yml (instance_ami and base_os in the job-shared queue section)
  2. Uncomment lines 792-796 and 1967-1973 in dispatcher.py