-
Hello!
I'm currently checking flash attention v2 and noticed that when copying from global memory to shared memory, the entire HeadDim (the K dimension in MNK tiling) needs to be copied to shared m…
-
Trying to implement pipelining using tl.range(..., num_stages=num_pipeline_stages)` for a persistent kernel.
Each SM executes the following operations per iteration:
Memory: There are 7 loads ( 32…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
INFO 09-10 05:05:28 importing.py:10] Triton not installed; certain GPU-related …
-
**Describe the bug**
```cuda
#include "cute/tensor.hpp"
using namespace cute;
__global__ void kernel(int *gmem) {
int tid = threadIdx.x;
gmem[tid * 4 + 0] = tid * 4 + 0;
gmem[tid * …
-
hi I saw that samsung phone messages from 2017 and more have the .smem format (surely a derivative of the previous ones), is it possible that this format can be integrated as well?
best regards
7zxkv updated
11 months ago
-
### RT-Thread Version
5.2.0 commit 2f559906d6202c27142237ab4b1d893034a5b7c3
### Hardware Type/Architectures
VEXPRESS_A9
### Develop Toolchain
GCC
### Describe the bug
### Steps to reproduce:
…
-
I see this:
```
enum class SmemSwizzleBits : uint8_t {
DISABLE = 0,
B32 = 1,
B64 = 2,
B128 = 3,
};
```
And I changed this to 0:
```
// tensor_map
utils::TmaDescriptor tensormap…
-
Could you add a performance tab which gives a breakdown of what is using up the swap memory (using smem)
-
**What is your question?**
I am learning to use cute to build a hgemm kernel. Tested on A10 GPU, the cute kernel is good with small problem size such as m/n/k = 4096, but I found it's much slower …
-
RT,tile_to_shape这个函数的作用是什么