potential bugs in run_pyscreener_distributed_example.batch

likun1212 commented 2 years ago

Hi

I think there are 2 potential bugs in run_pyscreener_distributed_example.batch( run_molpal.batch as well) I found with this script ray can not leverage all the resources on a cluster node.

1) in line 7 "#SBATCH --ntasks-per-node 4". I think "--ntasks-per-node" should always = 1. quote: "this will be used to guarantee that each Ray worker runtime will obtain the proper resources" see https://docs.ray.io/en/latest/cluster/slurm.html.

2) in line 31 and 41, you start ray cluster with " --num-cpus $SLURM_CPUS_ON_NODE ". However, this can only let ray use part of cpus in a node. for instance, say you ask 1 node and set "-c = 4" and "--ntasks-per-node 4", ray can only use 4*4=16 cpus eventhough you have 32 cpus in the node. ($SLURM_CPUS_ON_NODE will return 16 instead 32).

suggested config: ######################################################################

#!/bin/bash

#SBATCH --partition=???
#SBATCH --job-name=???
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err

### This script works for any number of nodes, Ray will find and manage all resources
#SBATCH --nodes=10
#SBATCH --exclusive
### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=1

# Load modules or your own conda environment here
# module load pytorch/v1.4.0-gpu
# conda activate ${CONDA_ENV}
source activate pyscreener

# ===== DO NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING =====
# This script is a modification to the implementation suggest by gregSchwartz18 here:
# https://github.com/ray-project/ray/issues/826#issuecomment-522116599
redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names
nodes_array=($nodes)

node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<< "$ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    ip=${ADDR[1]}
  else
    ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $ip"
fi

port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w "$node_1" \
  ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --block &
sleep 30

worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block &
  sleep 5
done

# ===== Call your code below =====
pyscreener --config vina_dock.ini --ncpu 2

########################################

davidegraff commented 2 years ago

I don’t believe this claim is right. the launcher portion of the batch script works by

starting one task on each node in the allocation
In that task, start a ray worker with all the CPUS on that node ($SLURM_CPUS_ON_NODE) By SLURM’s nature, a job step has exclusive access to all the resources on the nodes for which they were scheduled, meaning that if you start a ray worker within a job step, it will be able to use all the resources of the node on which it’s running. See the —exclusive section of the srun page:

This option applies to job and job step allocations, and has two slightly different meanings for each one. When used to initiate a job, the job allocation cannot share nodes with other running jobs (or just other users with the "=user" option or "=mcs" option). If user/mcs are not specified (i.e. the job allocation can not share nodes with other running jobs), the job is allocated all CPUs and GRES on all nodes in the allocation, but is only allocated as much memory as it requested. This is by design to support gang scheduling, because suspended jobs still reside in memory. To request all the memory on a node, use --mem=0. The default shared/exclusive behavior depends on system configuration and the partition's OverSubscribe option takes precedence over the job's option. This option can also be used when initiating more than one job step within an existing resource allocation (default), where you want separate processors to be dedicated to each job step. If sufficient processors are not available to initiate the job step, it will be deferred. This can be thought of as providing a mechanism for resource management to the job within its allocation (--exact implied).

The exclusive allocation of CPUs applies to job steps by default, but --exact is NOT the default. In other words, the default behavior is this: job steps will not share CPUs, but job steps will be allocated all CPUs available to the job on all nodes allocated to the steps.

There are multiple ways in which to start a ray cluster inside a SLURM job and this is only one of them. If you run this script with varying values of —ntasks-per-node you will see that your ray cluster possess more resources and your screen runs faster.

likun1212 commented 2 years ago

try this:

#SBATCH -N 1
#SBATCH -p ???
#SBATCH --ntasks-per-node 1
#SBATCH -c 4

echo $SLURM_CPUS_ON_NODE

################# say I have 32 cpus on that node, but it will return 4 instead of 32, since i am asking for 4 cpus.

davidegraff commented 2 years ago

You should read up on the SLURM documentation if you’re confused by this. You requested 4 so SLURM gave you 4 even if the node has 32. If you want all 32 then ask for them

coleygroup / pyscreener

potential bugs in run_pyscreener_distributed_example.batch #40