NVIDIA / Bobber

Containerized testing of system components that impact AI workload performance
MIT License
14 stars 5 forks source link

Update mechanism for synchronizing SSH keys #1

Open roclark opened 3 years ago

roclark commented 3 years ago

The Bobber containers use SSH keys that are generated during the image build process to communicate for multi-node. This forces the user to build an image on one machine, save the image as a local tarball, then copy and load the saved image on all other machines to ensure they use the same SSH key to communicate seamlessly. Given the size of the Bobber images, this can be a long, tedious process.

I propose a new method which would make it possible to install the image on all machines and dynamically generate the SSH key that can be transferred to multiple machines. By creating a simple bash script that creates an SSH key at runtime, copies it to all hosts, then copies that key to all of the running Bobber containers, a single command (ie. bobber sync host1,host2,host3,...) could be run to support multi-node communication. The following rough pseudo-code is an example:

# Generate the key.
ssh-keygen -t rsa -f /tmp/bobber/id_rsa ...
# Loop over all hosts
for host in hosts; do
  # Copy the key to the remote hosts. This will ask for the remote password.
  scp /tmp/bobber/id_rsa* host:/tmp/bobber
  # Copy the keys on each host to the Docker container. This will also ask for the remote password.
  docker cp /tmp/bobber/id_rsa.pub bobber:/root/.ssh/id_rsa.pub
done

The above example is by no means complete, but demonstrates the basic structure necessary to achieve the desired effect.

roclark commented 3 years ago

I've been doing some additional digging into this, and am leaning more towards the Pyxis/Enroot route. With that setup, we can bypass mpirun for our tests while still getting the expected performance. This would also remove our need to install an SSH server plus keys inside of the image, making it easy to scale out. For safekeeping, here's what I have running so far on a SLURM cluster with Pyxis/Enroot.

Update the nccl_multi.sh script by replacing the mpirun ... command with:

export NCCL_IB_HCA=$NCCL_IB_HCAS && \
  export NCCL_IB_TC=$NCCL_TC && \
  export NCCL_IB_GID_INDEX=$COMPUTE_GID && \
  export NCCL_IB_CUDA_SUPPORT=1 && \
  /nccl-tests/build/all_reduce_perf -b 8 -e ${NCCL_MAX}G -f 2

Build a new Docker image based on this and create an Enroot image of the new Docker image. Next, from the SLURM controller, run:

srun -N1 --ntasks-per-node=8 --mpi=pmix --exclusive --gres=gpu:8 --container-image /tmp/nvidia+bobber+6.1.4.dev.sqsh --container-mounts /mnt/fs:/mnt/fs_under_test /tests/nccl_multi.sh

This yields nearly identical results to the existing 1-node tests without using MPI or the SSH server.

With that being said, this raises the following questions/tasks:

  1. Does this work for multi-node as well?
  2. Can all of the other Bobber tests be updated to work this way?
  3. Will users still be able to run a single-node test without a SLURM cluster?
  4. Do we want to keep the current implementation of running multi-node Bobber tests?
  5. Should we wrap these srun commands into Bobber so users can do something like bobber run-all-multi ... to run this in a SLURM setup without specifying the extra pieces?
  6. How can we automate the process of importing the Enroot image onto test machines?
  7. Can the image be stored on NGC since it won't rely on SSH daemons and pre-baked keys?

@joehandzik, @fredvx, any thoughts?

joehandzik commented 3 years ago
  1. Excellent question, but I kind of assume no...even when running multinode via something like MLPerf, the batch command uses srun or mpirun commands. So I don't see how multi-node would work like this in any coordinated fashion. This just works because it's a single node test.
  2. Probably not, for the reasons in #1.
  3. Probably so - I bet what you're showing works fine without Slurm. It's just specifying the EVs on the command line.
  4. We need to figure out what looks cleanest. May refactor the commands along the lines of what you did with DALI - a core command that we call with an outer shell.
  5. I think so, personally, with a toggle that lets you say "slurm" or "docker" or "other-future-launch-style".
  6. I was thinking sbatch integration, similar to how MLPerf tests are launched.
  7. I say yes, at least optionally, but it would be nice to support a method to target a non-uploaded container too for dev or prerelease testing.
roclark commented 3 years ago

Updating here as I made a bit of progress since my last comment. I actually found that this does indeed work for multi-node tests with negligible differences between the existing multi-node implementation. This also gives me reasonable hope that we can do something similar with DALI and we might be able to get fio/mdtest working as well.

On your point on the non-NGC container, we should be able to do this if the local version of Bobber is different/newer than the latest version in NGC. If it doesn't find something in NGC, it should ideally build a local version if not done already. At least, that's my hope. 😄

roclark commented 3 years ago

I've found that putting the Bobber images on a publicly accessible cloud registry, like NGC, makes this much easier to manage as we can avoid accessing individual hosts altogether at that point. Assuming the image is available on NGC, the above srun command for NCCL can be updated to (note that this is just an example and it is NOT available on NGC at the moment so this will not work as listed):

srun -N1 --ntasks-per-node=8 --mpi=pmix --exclusive --gres=gpu:8 --container-image nvcr.io/nvidia/bobber:6.1.1 /mnt/fs:/mnt/fs_under_test /tests/nccl_multi.sh

This makes it much more attractive to me to get these images up on NGC in the future to try and have this be as simple as possible to use.

roclark commented 3 years ago

I created a slurm-support branch which includes these changes, but as a preview we can do our DALI tests in two stages:

First stage is to generate the dataset using Imageinary. This will look similar to the top half of the existing dali_multi.sh script. This should only run on a single process on a single node so that all clients can access the shared dataset. One note to improve performance is to allocate a large number of CPUs for the task given it's multithreaded, but the proper number should be read from the system resources to account for more flexible client setups.

#!/bin/bash
# SPDX-License-Identifier: MIT
if [ "x$GPUS" = "x" ]; then
    GPUS=8
fi

GPUS_ZERO_BASE=$(($GPUS-1))

mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images
mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline
mkdir -p /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline.idx
mkdir -p /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline.idx

imagine create-images --path /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images --name 4k_image_ --width 3840 --height 2160 --count $(($GPUS*1000)) --image_format jpg --size
imagine create-images --path /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images --name small_image_ --width 800 --height 600 --count $(($GPUS*1000)) --image_format jpg --size

imagine create-tfrecords --source_path /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images/images --dest_path /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline --name tfrecord- --img_per_file 1000
imagine create-tfrecords --source_path /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images/images --dest_path /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline --name tfrecord- --img_per_file 1000

for i in $(seq 0 $GPUS_ZERO_BASE); do /dali/tools/tfrecord2idx /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline/tfrecord-$i /mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline.idx/tfrecord-$i; done
for i in $(seq 0 $GPUS_ZERO_BASE); do /dali/tools/tfrecord2idx /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline/tfrecord-$i /mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline.idx/tfrecord-$i; done

Next the actual DALI tests should be called similar to the existing code in dali_multi.sh but replacing mpirun with Enroot/Pyxis. The trick is to have this wait until after the dataset generation above is completed. I haven't looked into it yet but I assume this is possible with an sbatch script or similar. This new method would also eliminate the need for the call_dali.sh script as we can do that directly with Enroot/Pyxis.

#!/bin/bash
# SPDX-License-Identifier: MIT

if [ "x$GPUS" = "x" ]; then
    GPUS=8
fi

if [ "x$BATCH_SIZE_SM" = "x" ]; then
    BATCH_SIZE_SM=150
fi

if [ "x$BATCH_SIZE_LG" = "x" ]; then
    BATCH_SIZE_LG=150
fi

python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_SM --epochs=11 -g $GPUS --remove_default_pipeline_paths --file_read_pipeline_images /mnt/fs_under_test/imageinary_data/800x600/file_read_pipeline_images
sysctl vm.drop_caches=3

python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_LG --epochs=11 -g $GPUS --remove_default_pipeline_paths --file_read_pipeline_images /mnt/fs_under_test/imageinary_data/3840x2160/file_read_pipeline_images
sysctl vm.drop_caches=3

python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_SM --epochs=11 -g $GPUS --remove_default_pipeline_paths --tfrecord_pipeline_paths "/mnt/fs_under_test/imageinary_data/800x600/tfrecord_pipeline/tfrecord-*"
sysctl vm.drop_caches=3

python3 /dali/dali/test/python/test_RN50_data_pipeline.py -b $BATCH_SIZE_LG --epochs=11 -g $GPUS --remove_default_pipeline_paths --tfrecord_pipeline_paths "/mnt/fs_under_test/imageinary_data/3840x2160/tfrecord_pipeline/tfrecord-*"
sysctl vm.drop_caches=3

rm -r /mnt/fs_under_test/imageinary_data

Still need to test this out more fully when I have access to resources again, but it is promising in my first rounds of tests. I just need to figure out the following:

  1. How does this look for multi-node tests?
  2. How do the results compare to the existing method?
  3. What is the best path to use multiple CPU cores for the Imageinary test?
  4. How can we run the first stage on a single node/task before running the actual DALI tests?