-
### System Info
If I use Optimum Neuron on Trainium with --gradient_accumulation_steps > 1 and training failed,
Then I modified line https://github.com/huggingface/transformers/blob/6d1f545665ac6…
-
Checklist
- [x] I've prepended issue tag with type of change: [feature]
- [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [ ] (If applicable) I've documented th…
-
This issue outlines the major items planned for Q2 2024. Note that it doesn't include bug fixes, except for major issues.
> [!NOTE]
> **Bold** means priority.
### Core features
- [x] **Multi-…
-
We need to train NEMO models on specialised hardware XLA/Trainium.
Are you planning to make this framework XLA compatible?
dennj updated
3 months ago
-
## Description
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.
In the node group [Terraform file](https://github.com/a…
-
### Describe the bug
if instance type is Trainium the neuron device plugin is wrongfully not installed
### Expected Behavior
if instance type is Trainium the neuron device plugin is installed
### …
-
I usually install `transformers_neuronx` from git, and since the last commit says that it was updated for SDK release 2.12, I assumed it was the same version available from GitHub. However, running `g…
-
-
### 🚀 Traceable Collectives!
Collective APIs (e.g. all_reduce, all_gather, ...) are used in distributed PyTorch programs, but do not compose cleanly with compilers.
Specifically, torchDynamo a…
-
Does FlexFlow have the capability to support XLA based devices (e.g. TPU, Trainium) or is it tied to Cuda ?