-
Hi, I am using Habana® Deep Learning Base AMI tensorflow 2.9.1, aws ec2 instance to train Image segmentation model. Training time for same dataset using similar number of Hpus/Gpus and almost similar …
-
### System Info
```shell
Optimum habana version: 1.5.0.dev
Docker image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1
```
### Information
- [x] The official …
-
Hello,
I have a custom model that I've incorporated BERT into. Is it possible to train this model using a normal training loop?
Example:
```
def training_loop(dataloader, model1):
device …
-
### Feature request
enable HMP for GPT2
### Motivation
BF16 has better performance than FP32
### Your contribution
submitting a PR
-
### System Info
```shell
optimum-habana 1.5.0
docker version 1.9.0
pytorch version 1.13.1
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] A…
-
### Version
__Karpenter Version:__ v0.22.1
__Kubernetes Version:__ v1.24
### Expected Behavior
Allocate a workload with `memory` limit & request which are suitable for dl1.24xlarge `memory` …
-
Hi, I am wondering how to check if Habana Gaudi hardware exists? Something similar to `nvidia-smi`.
-
after updating KubeVirt version from 0.47.1 we can't work at all. creating VMs give us immediately or after 1 min " watchdog: BUG: soft lockup - CPU#75 stuck"
I did a couple of tests: 0.47.1 work…
-
Hi, I followed the instruction from this [post](https://docs.habana.ai/en/latest/AWS_EC2_DL1_and_PyTorch_Quick_Start/AWS_EC2_DL1_and_PyTorch_Quick_Start.html#start-training-a-pytorch-model-on-gaudi). …
-
**What happened**:
start 0.48.1 I start to get watchdog: BUG: soft lockup sometimes directly after VM creation.
I did a couple of tests 0.47.1 works well and it starts to fail from 0.48.1
my susp…