Ray provides a simple, universal API for building distributed applications, read more about ray here.
Ray integration with LSF enables users to start up a Ray cluster on LSF and run DL workloads through that either in a batch or interactive mode.
conda env create -f sample_conda_env/sample_ray_env.yml
To test if you have ray installed with version number run:
conda activate ray
pip install -U ray
ray --version
ray, version 1.4.0
bsub -Is -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2" bash
./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num_epochs 5 --num-workers 4" -n "ray" -m 20000000000
Where:
-c is the user command that needs to be scaled under ray
-n is the conda namespace that will be activate before the cluster is spawned
-m is object store memory size in bytes as required by ray
Starting ray head node on: ccc2-10
The size of object store memory in bytes is: 20000000000
2021-06-07 14:19:11,441 INFO services.py:1269 -- View the Ray dashboard at http://127.0.0.1:3752
Where:
export PORT=3752
export HEAD_NODE=ccc2-10.sl.cloud.ibm.com
ssh -L $PORT:localhost:$PORT -N -f -l <username> $HEAD_NODE
http://127.0.0.1:3752
bsub -o std%J.out -e std%J.out -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2" ./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num-workers 4 --num_epochs 5" -n "ray" -m 20000000000