Closed sean-smith closed 1 month ago
No. We are not going to have a nccl tests per single service.
Either we changed to have 2 scripts for slurm: 1\ ami base and 2\ container based or no change at all.
It is way too much to maintain minimalistic variant of nccl script that will ultimately run on slurm.
Ok, I figured you'd say that so I included some sed
commands in the NCCL test script so we can use the nccl-tests-deep-learning-ami.sh
and just change the path to the all_reduce
binary. Ideally I'll just parameterize this.
This adds a nccl test script to accompany the workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/6cbbf337-c498-4c6b-ad4f-99d22e03d8dc/en-US/05-nccl-tests
It assumes NCCL is installed in
/opt/nccl
path and nccl tests are/opt/nccl-tests
.By doing it this way instead of using a docker image it's significantly faster for users to get started
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.