-
When I tried to run a toy example with DDP strategy on 2 `trn1.32xlarge` instances. To simplify the workload, I launched only 1 worker per instance (2 in total), but still got the following neff error…
-
Hi,
when running training of BART-base on a `trn1.2xlarge` using the command
```
torchrun --nproc_per_node=2 run_summarization.py
```
I receive the below compilation error, I am wondering…
-
After merging https://github.com/weaveworks/eksctl/pull/6763 we encountered a few issues and failures in our integration tests.
We need to fix these before release.
```
Tranium test
Summarizing …
-
## Description
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.
If your request is for a new feature, please use the `F…
-
## Description
When attempting to setup a managed node group containing an instance type that supports multiple NICs such as a p4d.24xlarge the launch template is setup incorrectly resulting nodes be…
-
### Tell us about your request
Karpenter is not aware of `aws.amazon.com/neuron` resources on the `trn` instance family.
### Tell us about the problem you're trying to solve. What are you trying to…
-
https://instances.vantage.sh/aws/ec2/trn1.32xlarge
no GPU info mentioned for these instances.
these other hosts seem to have this info - https://instances.vantage.sh/aws/ec2/p4de.24xlarge
-
Hey!
I am trying to follow this guide: https://huggingface.co/docs/optimum-neuron/tutorials/fine_tune_bert and fine tune BERT on a trn1.2xlarge instance. I setup the datasets as mentioned in the bl…
-
I tried to train my model with torch-neuronx editing the existing code. But in compile time, some errors are seen.
One of them is when compiling one-hot function in torch.
When I run my origina…
-
Inferentia and Trainium integration tests are failing with this error:
```
[0] (Integration) Inferentia nodes cluster with inf nodes with --install-neuron-plugin=false when adding an unmanaged nod…