Open roclark opened 3 years ago
I've been doing some additional digging into this, and am leaning more towards the Pyxis/Enroot route. With that setup, we can bypass mpirun for our tests while still getting the expected performance. This would also remove our need to install an SSH server plus keys inside of the image, making it easy to scale out. For safekeeping, here's what I have running so far on a SLURM cluster with Pyxis/Enroot.
Update the nccl_multi.sh script by replacing the mpirun ...
command with:
export NCCL_IB_HCA=$NCCL_IB_HCAS && \
export NCCL_IB_TC=$NCCL_TC && \
export NCCL_IB_GID_INDEX=$COMPUTE_GID && \
export NCCL_IB_CUDA_SUPPORT=1 && \
/nccl-tests/build/all_reduce_perf -b 8 -e ${NCCL_MAX}G -f 2
Build a new Docker image based on this and create an Enroot image of the new Docker image. Next, from the SLURM controller, run:
srun -N1 --ntasks-per-node=8 --mpi=pmix --exclusive --gres=gpu:8 --container-image /tmp/nvidia+bobber+6.1.4.dev.sqsh --container-mounts /mnt/fs:/mnt/fs_under_test /tests/nccl_multi.sh
This yields nearly identical results to the existing 1-node tests without using MPI or the SSH server.
With that being said, this raises the following questions/tasks:
srun
commands into Bobber so users can do something like bobber run-all-multi ...
to run this in a SLURM setup without specifying the extra pieces?@joehandzik, @fredvx, any thoughts?
Updating here as I made a bit of progress since my last comment. I actually found that this does indeed work for multi-node tests with negligible differences between the existing multi-node implementation. This also gives me reasonable hope that we can do something similar with DALI and we might be able to get fio/mdtest working as well.
On your point on the non-NGC container, we should be able to do this if the local version of Bobber is different/newer than the latest version in NGC. If it doesn't find something in NGC, it should ideally build a local version if not done already. At least, that's my hope. 😄
I've found that putting the Bobber images on a publicly accessible cloud registry, like NGC, makes this much easier to manage as we can avoid accessing individual hosts altogether at that point. Assuming the image is available on NGC, the above srun
command for NCCL can be updated to (note that this is just an example and it is NOT available on NGC at the moment so this will not work as listed):
srun -N1 --ntasks-per-node=8 --mpi=pmix --exclusive --gres=gpu:8 --container-image nvcr.io/nvidia/bobber:6.1.1 /mnt/fs:/mnt/fs_under_test /tests/nccl_multi.sh
This makes it much more attractive to me to get these images up on NGC in the future to try and have this be as simple as possible to use.
I created a slurm-support
branch which includes these changes, but as a preview we can do our DALI tests in two stages:
First stage is to generate the dataset using Imageinary. This will look similar to the top half of the existing dali_multi.sh
script. This should only run on a single process on a single node so that all clients can access the shared dataset. One note to improve performance is to allocate a large number of CPUs for the task given it's multithreaded, but the proper number should be read from the system resources to account for more flexible client setups.
#!/bin/bash
# SPDX-License-Identifier: MIT
if [ "x$GPUS" = "x" ]; then
GPUS=8
fi
GPUS_ZERO_BASE=$(($GPUS-1))
mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images
mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline
mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline.idx
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline.idx
imagine create-images --path /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images --name 4k_image_ --width 3840 --height 2160 --count $(($GPUS*1000)) --image_format jpg --size
imagine create-images --path /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images --name small_image_ --width 800 --height 600 --count $(($GPUS*1000)) --image_format jpg --size
imagine create-tfrecords --source_path /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images --dest_path /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline --name tfrecord- --img_per_file 1000
imagine create-tfrecords --source_path /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images --dest_path /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline --name tfrecord- --img_per_file 1000
for i in $(seq 0 $GPUS_ZERO_BASE); do /dali/tools/tfrecord2idx /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline/tfrecord-$i /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline.idx/tfrecord-$i; done
for i in $(seq 0 $GPUS_ZERO_BASE); do /dali/tools/tfrecord2idx /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline/tfrecord-$i /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline.idx/tfrecord-$i; done
Next the actual DALI tests should be called similar to the existing code in dali_multi.sh
but replacing mpirun
with Enroot/Pyxis. The trick is to have this wait until after the dataset generation above is completed. I haven't looked into it yet but I assume this is possible with an sbatch
script or similar. This new method would also eliminate the need for the call_dali.sh
script as we can do that directly with Enroot/Pyxis.
#!/bin/bash
# SPDX-License-Identifier: MIT
if [ "x$GPUS" = "x" ]; then
GPUS=8
fi
if [ "x$BATCH_SIZE_SM" = "x" ]; then
BATCH_SIZE_SM=150
fi
if [ "x$BATCH_SIZE_LG" = "x" ]; then
BATCH_SIZE_LG=150
fi
python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_SM --epochs=11 -g $GPUS --remove_default_pipeline_paths --file_read_pipeline_images /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images
sysctl vm.drop_caches=3
python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_LG --epochs=11 -g $GPUS --remove_default_pipeline_paths --file_read_pipeline_images /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images
sysctl vm.drop_caches=3
python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_SM --epochs=11 -g $GPUS --remove_default_pipeline_paths --tfrecord_pipeline_paths "/mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline/tfrecord-*"
sysctl vm.drop_caches=3
python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_LG --epochs=11 -g $GPUS --remove_default_pipeline_paths --tfrecord_pipeline_paths "/mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline/tfrecord-*"
sysctl vm.drop_caches=3
rm -r /mnt/fs_under_test/imageinary_data
Still need to test this out more fully when I have access to resources again, but it is promising in my first rounds of tests. I just need to figure out the following:
The Bobber containers use SSH keys that are generated during the image build process to communicate for multi-node. This forces the user to build an image on one machine, save the image as a local tarball, then copy and load the saved image on all other machines to ensure they use the same SSH key to communicate seamlessly. Given the size of the Bobber images, this can be a long, tedious process.
I propose a new method which would make it possible to install the image on all machines and dynamically generate the SSH key that can be transferred to multiple machines. By creating a simple bash script that creates an SSH key at runtime, copies it to all hosts, then copies that key to all of the running Bobber containers, a single command (ie.
bobber sync host1,host2,host3,...
) could be run to support multi-node communication. The following rough pseudo-code is an example:The above example is by no means complete, but demonstrates the basic structure necessary to achieve the desired effect.