🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
- Adds a new pattern for Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator.
Motivation
- To provide a robust solution for distributed pre-training of Llama2 using AWS Trainium instances, leveraging the capabilities of RayTrain and KubeRay Operator for efficient and scalable training workflows.
More
[x] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[ ] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
[ ] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
[x] Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally
…rain and KubeRay
What does this PR do?
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted. Consult the CONTRIBUTING guide for submitting pull-requests.
- Adds a new pattern for Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator.
Motivation
- To provide a robust solution for distributed pre-training of Llama2 using AWS Trainium instances, leveraging the capabilities of RayTrain and KubeRay Operator for efficient and scalable training workflows.
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes