-
Experiment with different implementations of Matmul:
- [x] Vanilla Matmul implementation
- [x] Vanilla Matmul with I/O optimized
- [x] GEMM (blocked matrix)
- [x] Threaded GEMM
- [x] GEMM on NEON…
-
Hello.
In the previous pull request #4381, the P and Q parameter of [SD]GEMM were increased to make better use of the L2 cache of Neoverse V1, but the complex [CZ]GEMM parameters remained unchanged. …
-
Hello, we have measured the FP8 GEMM performance using Triton on NVIDIA H100 (500 W, 1980 MHz). We would like to request your help in understanding if the performance is expected.
Since H100 FP8 o…
sryap updated
2 months ago
-
# Summary
When trying to use oneMKL with the portBLAS backend, the current code structure checks for an Intel, AMD or NVidia GPU, which if not found causes an unsupported error. It is understood that…
-
From the 22 Feb 2024 performance model review of Distilgpt2:
There are several gemms that are applied together(this is the tailend of attention):
```
@17 = hip::hip_copy_literal[id=main:@litera…
-
## 🐛 Bug
```
ld: warning: multiple common of .gomp_critical_user_.var
ld: error: duplicate symbol: libxsmm_verbosity
>>> defined at libxsmm_generator.c:31 (/usr/ports/math/dgl/work/dgl-2.2.1/thi…
-
love the package!
`BLAS.gemm!` fails for any `PDMat` arguments unless you pass `a.mat`.
Maybe something like could be more general:
```Julia
pd_gemm!(tA, tB, alpha, A, B, beta, C) = BLAS.ge…
-
SOTA (CUBLAS, CUTLASS) FP8 GEMM kernels are performing poorly for small M (bs*seq_len) < 32 regime.
This work will focus on leveraging the performant pieces of the [Marlin](https://github.com/IST-D…
-
cutlass is used in building kernels in tensorflow
I took a look in cutlass_archive/include/cutlass/matrix.h and indeed set_slice3x3 is not defined however set_slice_3x3 is
did not want to submi…
-
### Motivation.
At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separa…