-
源2.0-M32大模型研发团队深入分析当前主流的量化方案,综合评估模型压缩效果和精度损失表现,最终采用了GPTQ量化方法,并采用AutoGPTQ作为量化框架。
---------------------------------------------------------------------------------------------
Model: Yuan2-M32-HF-I…
-
The number provided are in terms of memory usage. It would be nice to provide numbers in terms of energy consumptions. That is current number shows that an LLM inference can costs twice the energy use…
-
/kind feature
**Describe the solution you'd like**
Hope add [https://github.com/xorbitsai/inference](https://github.com/xorbitsai/inference) as the kserve huggingface LLMs serving runtime
Xor…
-
### System Info
- tensorrtllm_backend built using Dockerfile.trt_llm_backend
- main branch tesnorrt llm (0.13.0.dev20240813000)
- 8xH100 SXM
- Driver Version: 535.129.03
- CUDA Version: 12.5
…
-
Workgroups are temporary time bounded groups, this project should specify the owning sig and be listed as a subproject of one of the SIGs in the metadata in github.com/kubernetes/community, you can id…
-
## What are the problems?(screenshots or detailed error messages)
想问下有性能分析的工具嘛?profiler相关,还是只能用nsight profile这种自己去看一些算子性能
## What are the types of GPU/CPU you are using?
GPU:A100-80G-SXM4
## What…
-
### Before submitting your bug report
- [ ] I believe this is a bug. I'll try to join the [Continue Discord](https://discord.gg/NWtdYexhMs) for questions
- [ ] I'm not able to find an [open issue](ht…
-
### Your current environment
```text
vLLM server latest, as of July 17th 2024: vllm/vllm-openai:latest
```
### 🐛 Describe the bug
I'm trying to get the logprobability of the last token (Y…
-
KServe is a community driven open source project, aiming to deliver a cloud-native, scalable, extensible serverless ML inference platform. It provides an open standard control and data plane for servi…
-
## Description
vLLM sampling parameters include a [richer set of values](https://github.com/vllm-project/vllm/blob/c9b45adeeb0e5b2f597d1687e0b8f24167602395/vllm/sampling_params.py), among which `lo…