-
### System Info
-GPU: 4 * 3090(24G)
-TensorRT-LLM version: 0.7.1, built from source released last week
-TensorRT version: 9.2.0.post12.dev5
-Nvidia Driver: Driver Version: 535.54.03 CUDA Versio…
-
# +34% higher throughput?
TLDR: Seeing vLLM has been really fascinating! @oleitersdorf and I investigated whether we could further accelerate vLLM by profiling its performance with GPU counters. Curr…
-
Hello.
[This](https://github.com/predibase/lorax/blob/309618cdb4cbc1807a6ce837a9f49062896f027b/server/lorax_server/utils/layers.py#L522) check holds when adapter's rank is at least 8 x num_shards (…
-
Is there a solution to accelerate the inference of large models through multi-core? The current approach is to assign the operator's tasks to multiple cores such as GEMM and GEMV, or to split the mode…
-
Reopening issue about `gemma-7b` prediction values.
This issue is still not solved: The perplexity values of gemma-2b and gemma-7b (much worse, near random) are very different. Wikitext-v2 token pe…
-
After looking at the code, neither moe nor dmoe support tensor-model-parallel.
@tgale96
-
## Goals
Following https://github.com/privacy-scaling-explorations/halo2curves/pull/86
MSM and FFT have been moved to halo2curves following rationales in https://github.com/privacy-scaling-explora…
-
### Describe the issue
Issue: Multiple GPU inference is broken with LLaVA 1.6. Same command with model liuhaotian/llava-v1.5-13b works fine.
Command:
CUDA_VISIBLE_DEVICES=0,1 python -m llava.se…
-
When I use llamacpp to reason about my Smuag-34B's model, there is no output when the input prompt has 150tokens, but the output is normal when scaled down to about 100.
-
Hi.
If a tensor is created in the main thread, its gradients panic when the operations happen in a different thread.
Example:
```
use std::{thread::sleep, time::Duration};
use burn::{backen…