Reuse the nccl test docker file for the unit tests
Test example:
Example of successful test execution:
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
ok - test_numa_topo_topo
ok - test_nvidia_gpu_count
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo
Example of failed test execution (when the GPU count doesn't match our config data):
# Running tests in gpu_unit_tests/tests/test_basic.sh
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
ok - test_numa_topo_topo
not ok - test_nvidia_gpu_count
# Unexpected gpu count
# test data value diff:
# --- test_sysinfo.sh.data/p3.2xlarge/gpu_count.txt 2024-07-09 01:28:17.000000000 +0000
# +++ /tmp/test_sysinfo.sh.actual-data.4MA/gpu_count.txt 2024-07-09 01:29:37.278476754 +0000
# @@ -1,2 +1,2 @@
# name, index, pci.bus_id
# -Tesla A100-SXM2-16GB, 0, 00000000:00:1E.0
# +Tesla V100-SXM2-16GB, 0, 00000000:00:1E.0
# common.sh:32:_assert_data()
# common.sh:37:assert_data()
# test_sysinfo.sh:39:test_nvidia_gpu_count()
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue #, if available:
Description of changes:
Add GPU unit tests. The tests contain following:
test_sysinfo.sh :: Validate basic system configuration by comparing it with config
10_test_basic_cuda.sh :: Execute trivial cuda binaries, fail if cuda subsys is not healthy Use demo-suite binaries https://docs.nvidia.com/cuda/demo-suite/index.html and DCGM Diagnostics https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#run-levels-and-tests If this test suite fail this is a sign that cuda subsystem is not usable at all. Usually this is side effect of system misconfiguration (driver or fabric manager is not loaded)
test_01_device_query
test_02_vector_add
test_03_bandwidth
test_04_bus_grind
test_05_dcgm_diagnostics
Reuse the nccl test docker file for the unit tests
Test example:
Example of successful test execution:
Example of failed test execution (when the GPU count doesn't match our config data):
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.