-
From what I understand, `smem_delta_a` is used to initialize the value of `delta_a` in shared memory in these lines:
https://github.com/state-spaces/mamba/blob/bc84fb1172e6dea04a7dc402118ed19985349e9…
-
Why do we need this barrier_O? What exactly is it waiting for? For each read of V/K, we have pipeline and producer_commit to control that flow. Also, the placement of barrier_O seems strange—it appear…
-
I've got a file that I'm editing as with `jinja-cpp` formatter.
I import a macro from another file with:
```
{%- from 'macros.jinja' import declare_smem_arrays with context %}
```
The mac…
-
**What is your question?**
Hello!
I am writing an int8 GEMM layer using cute.
I use `MMA_Atom` as my atom MMA, and define my tiled MMA as:
```
using TiledMma = TiledMMA< MMA_Atom_Arch, …
-
Consider the following FIRRTL:
```firrtl
FIRRTL version 4.0.0
circuit Top :
public module Top :
input clock : Clock
input raddr : UInt
input waddr : UInt
input wdata : UInt…
-
I modified the `tiled_copy.cu` example in cute/tutorial to use the following layout
```
auto tensor_shape = cute::Shape{};
auto block_shape = cute::Shape{};
...
Tensor tensor_S = make_tensor(m…
-
```
template
CUTLASS_DEVICE void
mma(Params const& mainloop_params,
MainloopPipeline pipeline_k,
MainloopPipeline pipeline_v,
PipelineState& smem_pipe_read_k…
-
**Description:**
I encountered an issue when using the CuTe library for matrix multiplication. The output result does not match the expected values, and there are unexpected odd numbers like 27 and 3…
-
**Describe the bug**
When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and b…
-
hi I saw that samsung phone messages from 2017 and more have the .smem format (surely a derivative of the previous ones), is it possible that this format can be integrated as well?
best regards