-
**What is your question?**
I added split_k serial of cutlass 2.x to cutlass 3.x, slice_k as a parameter of problem_size. Now I want to use cutlass_profiler to test whether I should add a parameter to …
-
What does the l_r parameter represent in the libxsmm_create_packed_gemm function? [TEST](https://github.com/libxsmm/libxsmm/blob/main/samples/xgemm_packed/gemm_packed_kernel.c) I'm trying to understan…
-
I didn't find Azure CycleCloud for CPU when parsing the mt-gemm results. For example, it should be in the plot here:
https://github.com/converged-computing/performance-study/tree/main/analysis/mt-g…
vsoch updated
3 weeks ago
-
We have achieved good performance (relative to the XeTLA library) for a GEMM kernel (see http://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1). Now is time to focus on improving per…
-
When I run GEMM benchmark on A770 I get about ~`0.3 TFLOPs`, while 1550 can get about `250 TFLOPs`
Performance table:
![image](https://github.com/user-attachments/assets/366947f8-82ce-4454-83ae-f…
-
### Issue type
Bug
### Have you reproduced the bug with TensorFlow Nightly?
Yes
### Source
source
### TensorFlow version
tf 2.15
### Custom code
Yes
### OS platform and distribution
_No res…
-
### Your current environment
python 3.8
L20*4
vllm 0.5.4
### Model Input Dumps
_No response_
### 🐛 Describe the bug
$python -m vllm.entrypoints.api_server --model='/mntfn/yanyi/Qwen2-…
-
USE_IPEX=0 python gemm_splitk_benchmark.py
```
/home/j…
-
### Problem Description
I am investigating usage of instruction v_mfma_f32_16x16x16_f16 and nvidia equivalent warp-level mma (swizzle SRAM memory + ldmatrix registers + mma over registers, for Ampere…
-
# Describe the bug
Gemm kernels with the following configurations hang for specific gemm shapes.
- Type: `e4m3 x e4m3 -> bf16`
- Tile: `256x32x128`
- Cluster: `2x1x1`
- Kernel Schedule: `KernelT…