NVIDIA / Bobber

Containerized testing of system components that impact AI workload performance
MIT License
14 stars 5 forks source link

Augment Docker error handling #48

Closed roclark closed 3 years ago

roclark commented 3 years ago

The Docker module needs extra error handling to help point users in the right direction when common errors pop up, like missing containers, version mismatches, and communication errors.

Additionally, the exit codes need to be updated to positive numbers in the range of 0-127 to be properly enumerated by the system. See the Python docs for more info.

Closes #47

Signed-Off-By: Robert Clark roclark@nvidia.com

roclark commented 3 years ago

Now properly handling the following scenarios:

  1. The Docker daemon is not running:
    $ bobber run-nccl test localhost
    Error: Could not communicate with the Docker daemon.
    Ensure Docker is running with "systemctl start docker"
  2. The Bobber container is not running:
    $ bobber run-nccl test localhost
    Bobber container not running. Launch a container with "bobber cast" prior to running any tests.
  3. The NVIDIA runtime could not be added to the container:
    $ bobber cast /raid
    NVIDIA container runtime not found. Ensure the latest nvidia-docker libraries and NVIDIA drivers are installed.
  4. The Bobber container and application have mismatched versions:
    $ bobber run-nccl test localhost
    Bobber container version mismatch.
    Kill the running Bobber container with "docker kill bobber" and re-cast a new container with "bobber cast" prior to running any tests.