-
What are some of the intended use cases for the 0.5B model.
There are not a lot of other similar sized models and neither is there a lot of hype around them. Though general audience seems to love th…
-
### Your current environment
[root@localhost wangjianqiang]# python -m vllm.entrypoints.openai.api_server --model /root/wangjianqiang/deepseek-moe/deepseek-coder-33b-base/ --tensor-parallel-size 8 …
-
拉取这个分支的VLLM: https://github.com/fyabc/vllm/tree/add_qwen2_vl_new
环境:Ubuntu + python 3.10
报错信息:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /home/…
-
I want to develop some features based on Sglang to improve the performance of srt.
1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance a…
-
- [ ] [At the Intersection of LLMs and Kernels - Research Roundup](https://charlesfrye.github.io/programming/2023/11/10/llms-systems.html)
# At the Intersection of LLMs and Kernels - Research Roundup…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A…
-
I am starting this issue to do a more thorough benchmarking than the [notebooks](/notebooks) used in the repo.
What should we measure:
1. Time for generation
2. Max GPU VRAM
3. Accuracy
Hardw…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
### Motivation
KV cache hit rates are probably the biggest performance impact for me, and I recently read:
https://research.character.ai/optimizing-inference/
> To solve this problem, we deve…
-
Thanks for your great work! Could you please update some resources to our paper "Parallel Speculative Decoding with Adaptive Draft Length"? I have attached a link to our blog and codebase below for yo…