aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
177 stars 74 forks source link

Add fsdp and smpv2 example for EKS #364

Closed iankouls-aws closed 3 months ago

iankouls-aws commented 3 months ago

Issue #, if available:

Description of changes:

The FSDP example launches Llama2 7b fully sharded data parallel distributed training on the huggingface c4 dataset. The smpv2 example launches SageMaker ModelParallel Llama2 7b distributed training on HyperPod EKS.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

iankouls-aws commented 3 months ago

@aruncs2005, could you please review the smpv2 examaple in this PR?

aruncs2005 commented 3 months ago

looks good to me.

iankouls-aws commented 3 months ago

@shimomut pls review