-
**Describe the bug**
The new Llama 2 70B features GQA. This causes an issue with inject_fused_attention.
When a user attempts to do inference on a Llama 2 70B model with inject_fused_attention=Tr…
-
I'm running on an intel arc 750, 32Gb RAM, there is more than enough disk space, what could be the problem?
```
sudo docker run -d \
--device /dev/dri \
-v /opt/ai/models/huggingface:/root…
-
# Trending repositories for C#
1. [**Jackett / Jackett**](https://github.com/Jackett/Jackett)
__API Support for your favorite torrent trackers__
13 stars today | 11,225 s…
-
I use the latest `tensorrtllm_backend` and `TensorRT-LLM ` of main branch to get docker images.
`https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#option-3-build-via-docker`
…
-
## Currently:
I noticed for importing data, it accepts only `json` files at the moment, as seen in `Import.tsx`.
## My Thoughts For Scaling:
For scaling this feature, we could go further and su…
-
We encourage you to join the [MLX Community](https://huggingface.co/mlx-community) on Hugging Face 🤗 and upload new MLX converted models and versions of existing models.
awni updated
7 months ago
-
A couple issues with the new tensor parallelism implementation!
1) Tensor Parallelism doesn't appear to respect a lack of flash attention, even via the -nfa flag. It also doesn't document flash att…
-
### Summary of the issue
First of all, thanks for the awesome effort for making code-evaluation-package. Highly appreciate it. However, right now, what I see is that it is integrated with just Hugg…
-
Hello, thank you for the work you are doing.
Does llama-adapter-v2 support llama2 or is it only working with llama?
I am able to pretrain with the weights of llama2 but the inference results do no…
-
### Your `minimal.lua` config
Same as minimal but
``` lua
strategies = { -- Change the adapters as required
chat = { adapter = "ollama" },
inline = { adapter = "ollama" },
…