awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
550 stars 180 forks source link

feat: Add Ray head Pod high availability with Redis #555

Closed ratnopamc closed 2 weeks ago

ratnopamc commented 2 weeks ago

What does this PR do?

Adds Ray Head pod High availability with enabling GCS fault tolerance Adds terraform module to create an elastic cache Redis cluster in AWS Updates website doc with a section on Ray head Pod high availability for Mistral-7b-inf2 blueprint.

πŸ›‘ Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted. Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

By default, Ray head node is a single point of failure. If it crashes then Ray worker nodes get restarted. This introduces a downtime and is not desirable for RayServe applications. It's important to enable the GCS fault tolerance by connecting Ray head node to an external Redis cluster in order to provide high availability for the Ray head node and avoid restarting Ray workers in the event of a Ray head node crash.

Closes #348

More

For Moderators

Additional Notes