Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
124 stars 66 forks source link

fairseq_moe_docker_slurm (Remove Slurm pinning) #683

Closed garvct closed 2 years ago

garvct commented 2 years ago

Removed Slurm pinning, we are using torch.distributed.launch with slurm (srun) starting 1 process per node. The slurm pinning is for 8 slurm tasks and so is unoptimal for a single slurm (srun) process.

yosoyjay commented 2 years ago

Looks good to me. @garvct did you test the impact on performance by chance?