-
A couple issues with the new tensor parallelism implementation!
1) Tensor Parallelism doesn't appear to respect a lack of flash attention, even via the -nfa flag. It also doesn't document flash att…
-
Hi
I am trying to run the API server with tensor parallelism (across either 2 or 4 GPUs). I am trying to run it with the following command:
```bash
python -m slora.server.api_server --max_total…
-
### Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.…
-
With 2x3090 - does the recently added tensor parallelism use NVLink in any manner? Thanks!
-
**Is your feature request related to a problem? Please describe.**
I am aware that PyTriton already have an example for using PyTriton with tensorrt_llm. But I noticed that the example only support s…
-
Hello,
I am encountering an issue related to my understanding of tensor parallelism in the PIM (Processing In Memory) model.
Specifically, I noticed a discrepancy in the Key-Value (KV) cache all…
-
### Your current environment
I have a server with only one NVLink connection, so I need to use pipeline parallelism and tensor parallelism within a single node to improve its performance. I would lik…
-
With the recent advent of large models (take Llama 3.1 405b, for example!), distributed inference support is a must! We currently support naive device mapping, which works by allowing a combination of…
-
### Bug description
For some reason, the tensor parallel implementation generates non-sensical outputs
```
⚡ python-api-tensor-parallel ~/litgpt litgpt generate_tp checkpoints/microsoft/phi-2
…
rasbt updated
1 month ago
-
### 🚀 The feature, motivation and pitch
I don't know if it's feasible or worthwhile to merge [this](https://github.com/IBM/vllm/tree/9855b99502c7537db5ef018129e603650800ac46), as maybe the trees ar…