ray-integration

Ray provides a simple, universal API for building distributed applications, read more about ray here.
Ray integration with LSF enables users to start up a Ray cluster on LSF and run DL workloads through that either in a batch or interactive mode.

Configuring Conda

Before you begin make sure you have conda install on your machine, details about installing conda on linux machine is here.
For reference sample conda env yml is present here, to create sample conda env that will run GPU and CPU workloads run, it has mix of conda and pip dependencies:
```
conda env create -f sample_conda_env/sample_ray_env.yml
```
To test if you have ray installed with version number run:
```
conda activate ray
pip install -U ray
ray --version
ray, version 1.4.0
```
Running ray as interactive LSF job
- Run the below bsub command to get multiple GPUs (i.e. 2 GPUs in this example) on multiple nodes (i.e. 2 hosts in this example) from LSF scheduler with 20GB hardlimit on memory
```
bsub -Is -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2" bash
```
- Sample workloads are present in sample_workload directory, sample_code_for_ray.py is CPU only workload and cifar_pytorch_example.py will work on CPU as well as GPU.
- Start the script by running the following command:
```
./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num_epochs 5 --num-workers 4" -n "ray" -m 20000000000
```
  Where:
  -c is the user command that needs to be scaled under ray
  -n is the conda namespace that will be activate before the cluster is spawned
  -m is object store memory size in bytes as required by ray
Acessing ray dashboard in interactive job mode:
- Get ray head node and dashboard port, please find below log lines on the console
```
Starting ray head node on:  ccc2-10
The size of object store memory in bytes is:  20000000000
2021-06-07 14:19:11,441 INFO services.py:1269 -- View the Ray dashboard at http://127.0.0.1:3752
```
  Where:
  - head node name: ccc2-10
  - dashboard port: 3752
- Run the below set of commands on the terminal to port forward dashboard from cluster to your local machine:
```
export PORT=3752
export HEAD_NODE=ccc2-10.sl.cloud.ibm.com
ssh -L $PORT:localhost:$PORT -N -f -l <username> $HEAD_NODE
```
- Access the dashboard at your laptop on:
```
http://127.0.0.1:3752
```
Running ray as a batch job
- Run the below command to run ray as batch job
```
bsub -o std%J.out -e std%J.out -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2"  ./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num-workers 4 --num_epochs 5" -n "ray" -m 20000000000
```
To access the dashboard please refer to log file generated for batch job and perform port forwarding referring to commands described above.

IBMSpectrumComputing / ray-integration

readme

ray-integration

Configuring Conda

Running ray as interactive LSF job

Acessing ray dashboard in interactive job mode:

Running ray as a batch job