To make it easier to run on large clusters, Bobber should be able to run on SLURM clusters with Pyxis and Enroot installed. This would replace the need for mpirun and SSH keys/daemons inside the containers, making it easier to run tests without copying images between nodes or synchronizing SSH keys.
This is currently a draft based on the ongoing discussion in #1. At this point, the NCCL tests should be fully functional using the Python wheel. As I see it, the following items are still required:
[x] Add DALI tests
[ ] Add FIO tests
[x] Add mdtest
[ ] Document the installation and usage
[ ] Update the troubleshooting guide with steps to fix common issues
To make it easier to run on large clusters, Bobber should be able to run on SLURM clusters with Pyxis and Enroot installed. This would replace the need for mpirun and SSH keys/daemons inside the containers, making it easier to run tests without copying images between nodes or synchronizing SSH keys.
Closes #1
Signed-Off-By: Robert Clark roclark@nvidia.com