awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
550 stars 180 forks source link

feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

Closed vara-bonthu closed 1 month ago

vara-bonthu commented 1 month ago

…rain and KubeRay

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted. Consult the CONTRIBUTING guide for submitting pull-requests.

- Adds a new pattern for Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator.

Motivation

- To provide a robust solution for distributed pre-training of Llama2 using AWS Trainium instances, leveraging the capabilities of RayTrain and KubeRay Operator for efficient and scalable training workflows.

More

For Moderators

Additional Notes

vara-bonthu commented 1 month ago

@5cp please review the PR. Thanks