aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
177 stars 74 forks source link

Nsight #343

Closed awsankur closed 3 months ago

awsankur commented 4 months ago

Issue #, if available:

Description of changes:

Added steps to:

  1. Install Nsight
  2. Added examples for NCCL Tests, Nemotron-15B, PyTorch FSDP on a Slurm cluster
  3. Added steps to setup Nsight on EKS. Added PyTorch FSDP training example with Llama2 on EKS

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.