This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
124
stars
66
forks
source link
Add test repeats in NCCL allreduce and NCCL allreduce loopback #690
On healthy nodes we observed some variability over time that may cause false test failures. Mitigating issue by repeating the test multiple times.
The test will exit successfully at the first occurrence of bandwidth above the designated threshold.
If the test is negative, it will be repeated up to the maximum amount of times indicated (default 1). The test will fail if all the tests report bandwidths below designated threshold.
On healthy nodes we observed some variability over time that may cause false test failures. Mitigating issue by repeating the test multiple times. The test will exit successfully at the first occurrence of bandwidth above the designated threshold. If the test is negative, it will be repeated up to the maximum amount of times indicated (default 1). The test will fail if all the tests report bandwidths below designated threshold.