Adds Ray Head pod High availability with enabling GCS fault tolerance
Adds terraform module to create an elastic cache Redis cluster in AWS
Updates website doc with a section on Ray head Pod high availability for Mistral-7b-inf2 blueprint.
π Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
Motivation
By default, Ray head node is a single point of failure. If it crashes then Ray worker nodes get restarted. This introduces a downtime and is not desirable for RayServe applications. It's important to enable the GCS fault tolerance by connecting Ray head node to an external Redis cluster in order to provide high availability for the Ray head node and avoid restarting Ray workers in the event of a Ray head node crash.
Closes #348
More
[x] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[x] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
[x] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
[x] Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally
What does this PR do?
Adds Ray Head pod High availability with enabling GCS fault tolerance Adds terraform module to create an elastic cache Redis cluster in AWS Updates website doc with a section on Ray head Pod high availability for Mistral-7b-inf2 blueprint.
π Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted. Consult the CONTRIBUTING guide for submitting pull-requests.
Motivation
By default, Ray head node is a single point of failure. If it crashes then Ray worker nodes get restarted. This introduces a downtime and is not desirable for RayServe applications. It's important to enable the GCS fault tolerance by connecting Ray head node to an external Redis cluster in order to provide high availability for the Ray head node and avoid restarting Ray workers in the event of a Ray head node crash.
Closes #348
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes