-
From what I understand, `smem_delta_a` is used to initialize the value of `delta_a` in shared memory in these lines:
https://github.com/state-spaces/mamba/blob/bc84fb1172e6dea04a7dc402118ed19985349e9…
-
Why do we need this barrier_O? What exactly is it waiting for? For each read of V/K, we have pipeline and producer_commit to control that flow. Also, the placement of barrier_O seems strange—it appear…
-
I've got a file that I'm editing as with `jinja-cpp` formatter.
I import a macro from another file with:
```
{%- from 'macros.jinja' import declare_smem_arrays with context %}
```
The mac…
-
```
template
CUTLASS_DEVICE void
mma(Params const& mainloop_params,
MainloopPipeline pipeline_k,
MainloopPipeline pipeline_v,
PipelineState& smem_pipe_read_k…
-
Currently, GDB + OpenOCD only supports debugging of RISC-V SoC. It should not be able to directly access trace components such as the Trace Encoder and SMEM. Furthermore, it cannot parse trace data. I…
-
I modified the `tiled_copy.cu` example in cute/tutorial to use the following layout
```
auto tensor_shape = cute::Shape{};
auto block_shape = cute::Shape{};
...
Tensor tensor_S = make_tensor(m…
-
**Describe the bug**
When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and b…
-
hi I saw that samsung phone messages from 2017 and more have the .smem format (surely a derivative of the previous ones), is it possible that this format can be integrated as well?
best regards
-
**What is your question?**
I encountered a strange bug.
Firstly, my SMEM is divided into two regions. One part is for the mainloop (reading A and B), and the other part is for the epilogue (writing…
-
The following test fails currently:
```c++
TEST_F(MatmulSchedulerTest, SelfMappingErrorSmemEpilogue1dBias) {
NVFUSER_TEST_CUDA_ARCH_RANGE_GUARD(7, 5, 9, 0);
Fusion fusion_obj;
Fusion* fusion = …