-
### Bug description
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds…
-
Hello, I have this problem: The size of tensor a (128) must match the size of tensor b (0) at non-singleton dimension 1, how do I solve it
-
Unit and integration tests currently needs to be run with `pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py`. If not, for instance with `pytests tests/`…
-
### 🐛 Describe the bug
I am using HPU device for testing, the following code shows the incorrect result on PyTorch 2.3.0. While it was correct for PyTorch 2.2.2
```
import torch
import habana_fram…
-
I do not know how to fix this. please help
python3 -m piper_train \
--dataset-dir ~/piper/my-training \
--accelerator 'gpu' \
--devices 1 \
--batch-size 32 \
--validati…
-
### Bug description
Hi, I am using PyTorch lightning to implement some new optimization strategies using `automatic_optimization=False`. For certain setting my optimization strategy (using `automa…
-
### Your current environment
```text
The output of `python collect_env.py`
```
### 🐛 Describe the bug
This issue is introduced by `block_softmax` kernel(part of `flat_pa`, see #169 )
For some …
-
Hi, if I understood correctly, to continue with the 16GB checkpoints the --ckpt-path is the right way to pass the weights. I tried resuming directly after training the base model for some hours, I onl…
-
This is a more generic requirement to all the container images created here. Many Kubernetes clouds have [security standard policy](https://kubernetes.io/docs/concepts/security/pod-security-standards)…
-
### System Info
TGI-gaudi 2.0.4 docker image.
Model = meta-llama/Meta-Llama-3.1-70B-Instruct
HW = Gaudi2, 4 cards
python 3.11
langchain 0.2.12
langchain-core 0.2.28…