aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
134 stars 57 forks source link

NCCL Tests for AWS ParallelCluster #328

Closed sean-smith closed 1 month ago

sean-smith commented 1 month ago

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sean-smith commented 1 month ago

No. We are not going to have a nccl tests per single service.

Either we changed to have 2 scripts for slurm: 1\ ami base and 2\ container based or no change at all.

It is way too much to maintain minimalistic variant of nccl script that will ultimately run on slurm.

Ok, I figured you'd say that so I included some sed commands in the NCCL test script so we can use the nccl-tests-deep-learning-ami.sh and just change the path to the all_reduce binary. Ideally I'll just parameterize this.