Closed roclark closed 3 years ago
Now properly handling the following scenarios:
$ bobber run-nccl test localhost
Error: Could not communicate with the Docker daemon.
Ensure Docker is running with "systemctl start docker"
$ bobber run-nccl test localhost
Bobber container not running. Launch a container with "bobber cast" prior to running any tests.
$ bobber cast /raid
NVIDIA container runtime not found. Ensure the latest nvidia-docker libraries and NVIDIA drivers are installed.
$ bobber run-nccl test localhost
Bobber container version mismatch.
Kill the running Bobber container with "docker kill bobber" and re-cast a new container with "bobber cast" prior to running any tests.
The Docker module needs extra error handling to help point users in the right direction when common errors pop up, like missing containers, version mismatches, and communication errors.
Additionally, the exit codes need to be updated to positive numbers in the range of 0-127 to be properly enumerated by the system. See the Python docs for more info.
Closes #47
Signed-Off-By: Robert Clark roclark@nvidia.com