-
In our paper we only showed results on causal language models, which use causally masked (decoder) self-attention.
If you'd like to use ALiBi for seq2seq tasks such as translation, speech or T5, o…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
###…
-
### 🚀 The feature, motivation and pitch
Thanks for fixing the soft-capping issue of the Gemma 2 models in the last release! I noticed there's still a [comment](https://github.com/vllm-project/vllm/bl…
-
I want this trainer class to be implemented with unsloth. How can i do that.
```class CustomTrainier(Trainer):
def __init__(self, model, args, train_dataset, eval_dataset, tokenizer, **kwargs)…
-
I think having flash attention in `equinox` should be a critical issue considering this is already natively built-in torch.
While XLA is supposed to (in theory) do some of the fusion, and possibly …
-
Hey,
How can I get token-level contributions for the search query? This seems one of the strong benefits of ColBERT for highlighting relevant matches, but for some reason, I can't find any implemen…
-
经打印llama-3.2-3b模型参数:
```
Model Parameters and their Shapes:
model.embed_tokens.weight: torch.Size([128256, 3072])
model.layers.0.self_attn.q_proj.weight: torch.Size([3072, 3072])
model.layers.0.s…
-
If document is open in a plank and stack attention jumps from the stack back to the standalone plank
https://github.com/user-attachments/assets/5f81904d-c0ea-46a3-89db-dd715143fd53
-
### Description
We do not have support for fp32 accumulate in sdpa family kernels. This becomes a problem when number of chunks gets large and we see diverging pcc from ground truth. For models that …
-
Hey, thanks for the great work. I could be wrong, but I feel like there is a disconnect between what is mentioned in the Based paper and what is used in the Figure 2 config for MQAR eval. In the paper…