-
### System Info
latest TGI docker image
### Information
- [X] Docker
- [ ] The CLI directly
### Tasks
- [X] An officially supported command
- [ ] My own modifications
### Reproduction
1. Use …
-
We are currently trying to apply torchtitan to MoE models. MoE models require using grouped_gemm https://github.com/fanshiqing/grouped_gemm. GroupedGemm ops basically follow the same rule as in Column…
-
Chu has merged inference code for models quantized by QuIP# into vllm(https://github.com/chu-tianxiang/vllm-gptq), but now the inference code only supports tensor_parallel_size=1. The reason is "Ha…
-
Hello,
Thank you for the great work.
I was wondering if scatter moe supported tensor parallelism?
Thank you!
-
Hi,
I was running Flan-t5 XXL with ctranslate2 and observed completely different results when run with tensor parallelism.
**To convert from HF to CT2:**
```bash
ct2-transformers-converter -…
-
### Feature request
Being able to split models into multiple GPUs, as with vllm/aphrodite engine for LLMs.
### Motivation
It would be extremely helpful to be able to split larger models into multip…
-
### System Info
I am experimenting with TRT LLM and `flan-t5` models. My simple goal is to build engines with different configurations and tensor parallelism, then review performance. Have a DGX syst…
-
### 🚀 The feature, motivation and pitch
I am trying to run a 70B model on a node with 3XA100-80Gi.
2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor p…
-
(Question; not request)
This came up when I worked on https://github.com/NVIDIA/Fuser/pull/2450. FusionExecutor (as well as MultiDeviceExecutor) has to allocate a tensor even when the device is out…
-
### Question Validation
- [X] I have searched both the documentation and discord for an answer.
### Question
I want to use the semantic splitter from llamaindex for document segmentation. Is…