-
```
>>> mii.pipeline("Qwen/Qwen1.5-14B-Chat", quantization_mode='wf6af16')
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████████| 14/14 [00:00
-
# 平台(如果交叉编译请再附上交叉编译目标平台):
# Platform(Include target platform as well if cross-compiling):
Centos 7.6
# Github版本:
# Github Version:
MNN-2.9.1
直接下载ZIP包请提供下载日期以及压缩包注释里的git版本(可通过``7z l zip包路径`…
-
Faced OOM on Arc with 6k input/512 out with VLLM serving, Mode: ChatGLM3-bB, Qwen1.5-32B on 4 ARC
-
比如 baichuan-7b-v1 目前是限时免费的
{
"models": [
"qwen-long",
"qwen-turbo",
"qwen-plus",
"qwen-max",
…
-
### Describe the bug
qwen1_5-14b-chat-q8_0.gguf need about 17G gpu mem, I have two T4, each of them has 16G gpu mem
when launch got failed to create llama context
If change to q4 which need about 8…
-
Got only 10 parallel request on 2 Arc with Qwen1.5 model (1024 input/512 out), could you please to improve the performance?
-
-
### System Info
env: NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 V100 16G*8
docker images: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
### …
-
命令行BUG,默认执行pytorch格式模型,测试qwen1.5-7b 或qwen1.5-32b awq、gptq量化模型
--model-format awq 或 --model-format gptq 不起作用,默认启动pytorch格式模型
-
模型:Qwen1.5-14B-Chat-GPTQ-int4
xinference新版本:v0.12.3 容器:docker pull xprobe/xinference:v0.12.3
xinference旧版本:v0.8.5 容器:docker pull xprobe/xinference:v0.8.5
我用的是docker容器版本运行的:
为什么xinference v0.…