aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
832 stars 312 forks source link

srun: error: Unable to allocate resources: X11 forwarding not available #3033

Closed rsignell-usgs closed 3 years ago

rsignell-usgs commented 3 years ago

I'm connected to my Ubuntu Parallel Cluster (created with pcluster=2.10.3) using

pcluster dcv connect hpc-cluster

and I'd like to run a graphical application on a node:

srun --x11 --nodes=1 --partition=efa-spot --time=2:00:00 --pty bash -i

and getting

srun: error: Unable to allocate resources: X11 forwarding not available

An ideas?

Here is my config:

[global]
cluster_template = hpc
update_check = true
sanity_check = true

[aws]
aws_region_name = us-east-2

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[scaling custom]
scaledown_idletime = 20

[cluster hpc]
key_name = AWS-HPC-Ohio-diqJdkaH
base_os = ubuntu1804
scheduler = slurm
master_instance_type = c5.2xlarge
vpc_settings = public-private
queue_settings = ondemand, spot, efa, gpu, efa-spot
dcv_settings = dcv
post_install = s3://coawst/no-tears-postinstall.sh
post_install_args = "/shared/spack-v0.16.0 v0.16.0 https://notearshpc-quickstart.s3.amazonaws.com/0.2.3/spack /opt/slurm/log sacct.log"
tags = {"QuickStart" : "NoTearsCluster"}
s3_read_resource = arn:aws:s3:::*
s3_read_write_resource = arn:aws:s3:::aws-hpc-ohio-data/*
master_root_volume_size = 50
ebs_settings = myebs
cw_log_settings = cw-logs
additional_iam_policies=arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore,arn:aws:iam::aws:policy/AmazonSSMPatchAssociation
fsx_settings=fsx-mount

[queue ondemand]
compute_resource_settings = od-small, od-medium, od-large
compute_type = ondemand
enable_efa = false
enable_efa_gdr = false
disable_hyperthreading = true
placement_group = DYNAMIC

[compute_resource od-small]
instance_type = c5.2xlarge
min_count = 0
max_count = 16
initial_count = 0

[compute_resource od-medium]
instance_type = c5.9xlarge
min_count = 0
max_count = 16
initial_count = 0

[compute_resource od-large]
instance_type = c5.18xlarge
min_count = 0
max_count = 16
initial_count = 0

[queue spot]
compute_resource_settings = sp-small, sp-medium, sp-large
compute_type = spot
enable_efa = false
enable_efa_gdr = false
disable_hyperthreading = true
placement_group = DYNAMIC

[compute_resource sp-small]
instance_type = c5.2xlarge
min_count = 0
max_count = 16
initial_count = 0
# If you don't specify a value, you're charged the Spot price, capped at the
# On-Demand price.
#spot_price = 0.5

[compute_resource sp-medium]
instance_type = c5.9xlarge
min_count = 0
max_count = 16
initial_count = 0
# If you don't specify a value, you're charged the Spot price, capped at the
# On-Demand price.
#spot_price = 0.5

[compute_resource sp-large]
instance_type = c5.18xlarge
min_count = 0
max_count = 16
initial_count = 0
# If you don't specify a value, you're charged the Spot price, capped at the
# On-Demand price.
#spot_price = 0.5

[queue efa]
compute_resource_settings = efa-large
compute_type = ondemand
enable_efa = true
enable_efa_gdr = false
disable_hyperthreading = true
placement_group = DYNAMIC

[queue efa-spot]
compute_resource_settings = efa-spot-large
compute_type = spot
enable_efa = true
enable_efa_gdr = false
disable_hyperthreading = true
placement_group = DYNAMIC

[compute_resource efa-spot-large]
instance_type = c5n.18xlarge
min_count = 0
max_count = 16
initial_count = 0

[compute_resource efa-large]
instance_type = c5n.18xlarge
min_count = 0
max_count = 16
initial_count = 0

[queue gpu]
compute_resource_settings = gpu-large
compute_type = ondemand
enable_efa = false
enable_efa_gdr = false
disable_hyperthreading = true
placement_group = DYNAMIC

[compute_resource gpu-large]
instance_type = g4dn.12xlarge
min_count = 0
max_count = 16
initial_count = 0

[ebs myebs]
volume_size = 200
shared_dir = /shared

[fsx fsx-mount]
shared_dir = /scratch
fsx_fs_id = fs-06356

[dcv dcv]
enable = master
port = 8443
access_from = 0.0.0.0/0

[cw_log cw-logs]
enable = false

[vpc public-private]
vpc_id = vpc-0eea367
master_subnet_id = subnet-03fd6
compute_subnet_id = subnet-0cbfa8b4
# SG for FSx Lustre
additional_sg = sg-06338a6
use_public_ips = false

with post-install script:

#!/bin/bash
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
set +e

exec &> >(tee -a "/tmp/post_install.log")

. "/etc/parallelcluster/cfnconfig"

echo "post-install script has $# arguments"
for arg in "$@"
do
    echo "arg: ${arg}"
done

# Enables qstat for slurm
YUM_CMD=$(which yum)
APT_GET_CMD=$(which apt-get)
if [[ ! -z $YUM_CMD ]]; then
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm -P /tmp
    yum install -y /tmp/epel-release-latest-7.noarch.rpm

    yum install -y perl-Switch python3 python3-pip links
    getent passwd ec2-user > /dev/null 2&>1
    if [ $? -eq 0 ]; then
        OSUSER=ec2-user
        OSGROUP=ec2-user
    else
        OSUSER=centos
        OSGROUP=centos
    fi

    # Add nvidia-docker if possible
    nvidia-smi -L > /dev/null
    if [ $? -eq 0  ]; then
     distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
     curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | tee /etc/yum.repos.d/nvidia-docker.repo
     yum install -y nvidia-docker2
     groupadd docker
     usermod -aG docker $OSUSER
     systemctl restart docker
    fi
elif [[ ! -z $APT_GET_CMD ]]; then
    apt-get update
    apt-get install -y libswitch-perl python3 python3-pip links
    OSUSER=ubuntu
    OSGROUP=ubuntu

    # Add nvidia-docker if possible
    nvidia-smi -L > /dev/null
    if [ $? -eq 0  ]; then
     distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
     curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add - && distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list && apt-get update && apt-get install -y nvidia-docker2 && pkill -SIGHUP dockerd
     groupadd docker
     usermod -aG docker $OSUSER
     systemctl restart docker
    fi
else
    echo "error can't install package $PACKAGE"
    exit 1;
fi

pip3 install --upgrade awscli boto3

# Override with $2 if set, or use default paths
spack_install_path=${2:-/shared/spack}
spack_tag=${3:-releases/v0.16}
spack_config_uri=${4:-https://notearshpc-quickstart.s3.amazonaws.com/0.2.0/spack}
accounting_log_path=${5:-/opt/slurm/log}
accounting_log_file=${6:-sacct.log}

env > /opt/user_data_env.txt

case "${cfn_node_type}" in
    MasterServer)

        export AWS_DEFAULT_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | rev | cut -c 2- | rev)
        aws configure set default.region ${AWS_DEFAULT_REGION}
        aws configure set default.output json

        # Setup spack on master:
        git clone https://github.com/spack/spack -b ${spack_tag} ${spack_install_path}

        # On both: load spack at login
        echo ". ${spack_install_path}/share/spack/setup-env.sh" > /etc/profile.d/spack.sh
        echo ". ${spack_install_path}/share/spack/setup-env.csh" > /etc/profile.d/spack.csh

        mkdir -p ${spack_install_path}/etc/spack
        # V2.0 borrowed "all:" block from https://spack-tutorial.readthedocs.io/en/latest/tutorial_configuration.html

        # Autodetect OPENMPI, INTELMPI, SLURM, LIBFABRIC and GCC versions to inform Spack of available packages.
        # e.g., OPENMPI_VERSION=4.0.3
        export OPENMPI_VERSION=$(. /etc/profile && module avail openmpi 2>&1 | grep openmpi | head -n 1 | cut -d / -f 2)
        # e.g., INTELMPI_VERSION=2019.7.166
        export INTELMPI_VERSION=$(. /etc/profile && module show intelmpi 2>&1 | grep I_MPI_ROOT | sed 's/[[:alpha:]|_|:|\/|(|[:space:]]//g' | awk -F- '{print $1}' )
        # e.g., SLURM_VERSION=19.05.5
        export SLURM_VERSION=$(. /etc/profile && sinfo --version | cut -d' ' -f 2)
        # e.g., LIBFABRIC_VERSION=1.10.0
        # e.g., LIBFABRIC_MODULE=1.10.0amzn1.1
        export LIBFABRIC_MODULE=$(. /etc/profile && module avail libfabric 2>&1 | grep libfabric | head -n 1 )
        export LIBFABRIC_MODULE_VERSION=$(. /etc/profile && module avail libfabric 2>&1 | grep libfabric | head -n 1 |  cut -d / -f 2 )
        export LIBFABRIC_VERSION=${LIBFABRIC_MODULE_VERSION//amzn*}
        # e.g., GCC_VERSION=7.3.5
        export GCC_VERSION=$( gcc -v 2>&1 |tail -n 1| awk '{print $3}' )

        #NOTE: as of parallelcluster v2.8.0, SLURM is built with PMI3

        echo "Pulling Config: ${spack_config_uri}"
        case "${spack_config_uri}" in
            s3://*)
                aws s3 cp ${spack_config_uri}/packages.yaml /tmp/packages.yaml --quiet;
                aws s3 cp ${spack_config_uri}/modules.yaml /tmp/modules.yaml --quiet;
                aws s3 cp ${spack_config_uri}/mirrors.yaml /tmp/mirrors.yaml --quiet;;
            http://*|https://*)
                wget ${spack_config_uri}/packages.yaml -O /tmp/packages.yaml -o /tmp/debug_spack.wget;
                wget ${spack_config_uri}/modules.yaml -O /tmp/modules.yaml -a /tmp/debug_spack.wget;
                wget ${spack_config_uri}/mirrors.yaml -O /tmp/mirrors.yaml -a /tmp/debug_spack.wget;;
            *)
                echo "Unknown/Unsupported spack packages URI"
                ;;
        esac
        envsubst < /tmp/packages.yaml > ${spack_install_path}/etc/spack/packages.yaml
        cat ${spack_install_path}/etc/spack/packages.yaml

        envsubst < /tmp/modules.yaml > ${spack_install_path}/etc/spack/modules.yaml
        cat ${spack_install_path}/etc/spack/modules.yaml

        envsubst < /tmp/mirrors.yaml > ${spack_install_path}/etc/spack/mirrors.yaml
        cat ${spack_install_path}/etc/spack/mirrors.yaml

    echo "OSUSER=${OSUSER}"
    echo "OSGROUP=${OSGROUP}"
    chown -R ${OSUSER}:${OSGROUP} ${spack_install_path}
    chmod -R go+rwX ${spack_install_path}

    #. /etc/profile.d/spack.sh
    su - ${OSUSER} -c ". /etc/profile && curl -o /tmp/amzn2-e4s.pub https://s3.amazonaws.com/spack-mirrors/amzn2-e4s/build_cache/_pgp/7D344E2992071B0AAAE1EDB0E68DE2A80314303D.pub && spack gpg trust /tmp/amzn2-e4s.pub"
    su - ${OSUSER} -c ". /etc/profile && spack install miniconda3"
    su - ${OSUSER} -c ". /etc/profile && module load miniconda3 && conda upgrade conda -y"

    mkdir -p ${accounting_log_path}
    chmod 755 ${accounting_log_path}
    touch ${accounting_log_path}/${accounting_log_file}
    chmod 644 ${accounting_log_path}/${accounting_log_file}
    chown slurm:slurm ${accounting_log_path}/${accounting_log_file}
    cat << EOF > /opt/slurm/etc/enable_sacct.conf

JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageHost=
#AccountingStorageLoc=
AccountingStorageLoc=${accounting_log_path}/${accounting_log_file}
#AccountingStoragePass=
#AccountingStorageUser=

MinJobAge=172800
EOF
    grep -qxF 'include enable_sacct.conf' /opt/slurm/etc/slurm.conf || echo 'include enable_sacct.conf' >> /opt/slurm/etc/slurm.conf

    systemctl restart slurmctld.service:

    ;;
    ComputeFleet)
        # On both: load spack at login
        echo ". ${spack_install_path}/share/spack/setup-env.sh" > /etc/profile.d/spack.sh
        echo ". ${spack_install_path}/share/spack/setup-env.csh" > /etc/profile.d/spack.csh
        sudo sed -i "s/Unattended-Upgrade \"1\"/Unattended-Upgrade \"0\"/g" /etc/apt/apt.conf.d/20auto-upgrades
    ;;
    *)
    ;;
esac
demartinofra commented 3 years ago

Most likely this is due to Slurm not being compiled with x11 support. We'll have to look into what are the missing packages to enable x11 support in Slurm when compiling it if that is the issue.

In the meanwhile one potential solution would be for you to recompile Slurm with such support. As a reference here is how we compile Slurm in ParallelCluster: https://github.com/aws/aws-parallelcluster-cookbook/blob/develop/recipes/slurm_install.rb#L75-L87

no-response[bot] commented 3 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

rsignell-usgs commented 3 years ago

Thanks @demartinofra !