-
### 🐛 Describe the bug
The batched GEMM has a poor performance for bigger batch size(`12*7*120*64*129`) with smaller matrix size(`3x3`, `3x1`):
```python
import torch
import time
points = t…
-
### Go version
go version go1.23.2 darwin/arm64
### Output of `go env` in your module/workspace:
```shell
GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/aimuz/Library/Caches/go-…
aimuz updated
2 weeks ago
-
### 🐛 Describe the bug
Under specific inputs, `reflection_pad1d_backward` triggered a crash.
```python
import torch
grad_output = torch.full((2,8,2,2,2,10,9,9,), 9.87654e+09, dtype=torch.float)
…
-
### 🐛 Describe the bug
# Problem
When running compiled FlexAttention in a multi-GPU environment, if the device being used is not the first GPU (i.e., not `cuda` or `cuda:0`, but `cuda:1`, etc.), a…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
The script to reproduce the bug.
```python
import os
import time
import pickle
import torch
import threading
import torch.distributed as dist
import torch.distributed.distributed_c10d as c10…
-
### Describe the issue
I am testing AMX's performance in BF16 inference. It turns out that under different settings of `DNNL_MAX_CPU_ISA` (`AVX512_CORE_AMX` `AVX512_CORE_BF16` `AVX512_CORE_VNNI`), …
-
### OpenVINO Version
2023.1.0
### Operating System
Ubuntu 22.04 (LTS)
### Hardware Architecture
x86 (64 bits)
### Target Platform
Architecture: x86_64
CPU op-mode(s): 32-bi…
-
### 🐛 Describe the bug
This script loads a list of tensors and diffs `_foreach_norm` and `[torch.norm(t) for t in ...]`:
```
import torch
ts = torch.load('list_of_tensors.pt', weights_only=Tru…
-
### Your current environment
```
Collecting environment information...
PyTorch version: 2.1.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ub…