-
### 🚀 The feature, motivation and pitch
Gemma-2 and new Ministral models use alternating sliding window and full attention layers to reduce the size of the KV cache.
The KV cache is a huge inferen…
-
Hi Bro,it's me again.
I read your paper again and plan to share your idea, but a little confuse.
I find that LAM don't match the code.The structure of the LAM module in the paper show the low-freq…
-
* The terminal process "/bin/bash '-c', '/usr/local/cuda-12.4/bin/nvcc -g -G -diag-suppress=177 -lineinfo --std=c++17 -arch=sm_75 '-D CUTE_ARCH_LDSM_SM75_ACTIVATED' -o flash_attention_cutlass_standa…
-
Is there any example code to do this? Should I generate new BlockMask everytime?
Thanks!
------------------------------
Essentially, I have problem of slicing BlockMask. For exmaple, if we have…
-
Update 2024/10/21
Hi, after debugging, I find `rank: 7, local_label shape: torch.Size([1, 3086]), locak_label max: 128009, locak_label min: -100, logits_shape: torch.Size([1, 3086, 128256])` In SF…
-
### System Info
TensorRT Model Optimizer: 0.15.1
TensortRT-LLM version: 0.14.0.dev2024100100
Python version
OS: Ubuntu 22.04
CPU Arch: x86_63
Driver version: 555.42.02
CUDA Version:12.5
### Who can…
-
As of right now FlashAttention only supports one-dimensional local attention. I intend to implement up to three-dimensional local attention where the effective attention mask would be a rectangular cu…
-
### Description
I am trying to fine-tune Gemma 2 on TPU and got the following error:
```
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/jax/_src/compiler.py", l…
-
please help me solving this issue
Optimize_text_embed: 0% 0/49 [00:00
-
Hi, first of all, thank you for your work. I have a question:
global_x = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1=w, w2=w)(x)
global_x = self.grid_attn(global_x)
global_…