Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
122 stars 65 forks source link

run_nccl_tests_ncv4 (Remove NCCL_GRAPH_FILE) #730

Closed garvct closed 1 year ago

garvct commented 1 year ago

Setting NCCL_GRAPH_FILE causes NCCL to fail on NC96ads_A100_v4 when run on < 4 GPU's. If also causes NCCL tests to run significantly slower on one NC48ads_A100_v4 (without NCCL_GRAPH_FILE set ~220 GB/s, with NCCL_GRAPH_FILE set, ~58 GB/s)