-
Create a detailed blueprint to set up [Kueue with Ray](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/) to demonstrate large scale batch processing (training, batch predictions, etc.). Customers are…
-
Hi,
I'm trying to run `notebooks/text-classification/notebook.ipynb` . Each time it runs, it recompiles, using a different `/tmp/` path each time, e.g.
`2023-06-27 17:59:23.000182: INFO ||NCC_WRA…
-
## Description
I'm unable to run the trainium-inferentia BERT pretrain model. Following error is showing up during building:
`Traceback (most recent call last):
File "/home/ec2-user/.local/bi…
-
I am trying to use a jsonl file for fine tuning with optimum neuron using a trainium instance. I get this error:
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/hf_argparser.py", lin…
-
Hello,
We created an [example](https://github.com/philschmid/aws-neuron-samples/blob/main/training/getting-started.ipynb) on how to fine-tune BERT on the `Banking77` dataset, which has 77 labels, …
-
I Found a new edge case. When you train and try to push to the Trainium cache during training and you get a 500 from the hub during the upload of the neffs, the training gets stuck. My training is stu…
-
** What's the problem: **
Presently, deploying [Neuron plugins for k8s](https://github.com/aws-neuron/aws-neuron-sdk/tree/v2.11.0/src/k8) requires manual configuration using YAML files, which can b…
-
## 🐛 Bug
Not sure if it is a bug or expected behavior, when doing `module.to(xm.xla_device())` it will create new parameter tensors instead of modify the tensor in place.
## To Reproduce
Runn…
-
# Issue: Training on AWS Trainium Failing
I'm encountering issues when trying to train certain architectures on AWS Trainium. Initially, we attempted to migrate our code from a BERT model, fully im…
-
### What feature/behavior/change do you want?
If an instance type is only available in a single availability zone in a given region, then randomly select the remaining AZs to meet the minimum A…