NVIDIA / Bobber

Containerized testing of system components that impact AI workload performance
MIT License
14 stars 5 forks source link

Add SLURM support for multi-node tests #65

Open roclark opened 3 years ago

roclark commented 3 years ago

To make it easier to run on large clusters, Bobber should be able to run on SLURM clusters with Pyxis and Enroot installed. This would replace the need for mpirun and SSH keys/daemons inside the containers, making it easier to run tests without copying images between nodes or synchronizing SSH keys.

Closes #1

Signed-Off-By: Robert Clark roclark@nvidia.com

roclark commented 3 years ago

This is currently a draft based on the ongoing discussion in #1. At this point, the NCCL tests should be fully functional using the Python wheel. As I see it, the following items are still required: