Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
124 stars 66 forks source link

Add test repeats in NCCL allreduce and NCCL allreduce loopback #690

Closed vanzod closed 2 years ago

vanzod commented 2 years ago

On healthy nodes we observed some variability over time that may cause false test failures. Mitigating issue by repeating the test multiple times. The test will exit successfully at the first occurrence of bandwidth above the designated threshold. If the test is negative, it will be repeated up to the maximum amount of times indicated (default 1). The test will fail if all the tests report bandwidths below designated threshold.