-
When I was running the (code example)[https://github.com/user-attachments/files/17388059/sgemm_sm80_tmp.txt] provided by @ccecka in another [issue](https://github.com/NVIDIA/cutlass/issues/1858), I go…
-
**What is your question?**
hello, I am developing a full precision attention backward kernel using cutlass, and get stuck in the use of ldmatrix and mma instructions for fp32.
My Gemm calculation is …
-
Hello,
I have a question regarding census transformation when reading your code. Why aren't you comparing the intensity of neighboring pixels with the center?
Best regards.
Here is the code i…
-
I encountered some problems when using predicate tensor.
In the tutorials:
https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/tiled_copy.cu
https://github.com/NVIDIA/cutlass/blob/mai…
-
**What is your question?**
I am learning to use cute to build a hgemm kernel. Tested on A10 GPU, the cute kernel is good with small problem size such as m/n/k = 4096, but I found it's much slower …
-
### Description
Tried to write a test with kernel for scalar using Smem/SReg:
```
def test_scalar_exp(self):
def scalar_exp_kernel(in_smem_ref: Float, out_smem_ref: Float):
in_sreg = …
-
Hi, I've just created a small project ([link to the project](https://github.com/Yanksi/cute_mma)) by modifying the `sgemm_sm80` example. What I was doing was trying to make use of the tensor cores for…
-
(1) After inner persistent buffers are stored in shared memory. There are still bank conflicts if the persistent buffer is NOT projected to inputs due to two reasons:
```
(a) We are missing a cacheBe…
-
### RT-Thread Version
5.2.0 commit 2f559906d6202c27142237ab4b1d893034a5b7c3
### Hardware Type/Architectures
VEXPRESS_A9
### Develop Toolchain
GCC
### Describe the bug
### Steps to reproduce:
…
-
When profiling transpiled muGrpahs, some results are extremely low and are close to kernel launch time. For example, in the gated_mlp example, some muGraphs only consume ~0.004ms in the catalyst clust…