-
## 🚀 Feature
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html
https://aws.amazon.com/machine-learning/neuron/
### Motivation
https://aws.amazon.com/about-aws/whats-new/2022…
-
## Description
The EKS cluster versions in various blueprints are getting a bit behind. We should upgrade all of our blueprints to latest (1.29 at the time of this wrting, 1.30 due soon).
Find t…
-
## Description
For the AWS Batch scheduler integration in torchx, support wildcards in the `job_queue` runopt with a simple (or configurable or extendable) queue selection logic. For instance, assu…
-
Torch neuron is a PyTorch architecture that enables it to use AWS based Trainium and Inferentia gpu instances. Since these are somewhat cheaper, especially for large models that may be a little too la…
-
Doing hardware-accelerated inference in a serverless environment is compelling use case.
However, adding straight up GPU passthrough means that microVM can't oversubscribe memory, and we need to a…
-
### 🚀 The feature, motivation and pitch
Consider implementing BFloat16 addition/subtraction operations with stochastic rounding, as it is critical for training large models with the BFloat16 optimi…
-
## 🐛 Bug
In FSDP backward pass, if we accumulate some callbacks and invoke them later in one batch, then different runs can result in slightly different computation graphs and cause recompilation.
…
-
### Community Note
* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the…
-
## Description
After applying the jupyterhub on EKS blueprint I am able to access the home screen via port forwarding. I can then select one of the provided options to setup a server:
* Data Enginee…
-
### What happened + What you expected to happen
### What Happened:
When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple `inf2` nodes, the syste…