Qwen2-57B-A14B-Instruct-GPTQ-Int4推理极慢

双4090D显卡，CUDA：12.4，按官方代码执行，非常简单的推理居然要两三分钟，期间GPU使用率一直打到70%
1718190051233
Package                           Version
--------------------------------- ------------
accelerate                        0.31.0
addict                            2.4.0
aiohttp                           3.9.5
aiosignal                         1.3.1
aliyun-python-sdk-core            2.15.1
aliyun-python-sdk-kms             2.16.3
annotated-types                   0.6.0
anyio                             4.3.0
async-timeout                     4.0.3
attrs                             23.2.0
auto_gptq                         0.7.1
certifi                           2024.2.2
cffi                              1.16.0
charset-normalizer                3.3.2
click                             8.1.7
cloudpickle                       3.0.0
cmake                             3.29.3
coloredlogs                       15.0.1
crcmod                            1.7
cryptography                      42.0.8
datasets                          2.18.0
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.6.1
einops                            0.8.0
email_validator                   2.1.1
exceptiongroup                    1.2.1
fastapi                           0.111.0
fastapi-cli                       0.0.3
filelock                          3.14.0
flash-attn                        2.5.9.post1
frozenlist                        1.4.1
fsspec                            2024.2.0
gast                              0.5.4
gekko                             1.1.1
h11                               0.14.0
httpcore                          1.0.5
httptools                         0.6.1
httpx                             0.27.0
huggingface-hub                   0.23.0
humanfriendly                     10.0
idna                              3.7
importlib_metadata                7.1.0
interegular                       0.3.3
Jinja2                            3.1.4
jmespath                          0.10.0
joblib                            1.4.2
jsonschema                        4.22.0
jsonschema-specifications         2023.12.1
lark                              1.1.9
llvmlite                          0.42.0
lm-format-enforcer                0.10.1
markdown-it-py                    3.0.0
MarkupSafe                        2.1.5
mdurl                             0.1.2
modelscope                        1.15.0
mpmath                            1.3.0
msgpack                           1.0.8
multidict                         6.0.5
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.2.1
ninja                             1.11.1.1
numba                             0.59.1
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 8.9.2.26
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.550.52
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.1.105
openai                            1.28.1
optimum                           1.20.0
orjson                            3.10.3
oss2                              2.18.5
outlines                          0.0.43
packaging                         24.0
pandas                            2.2.2
peft                              0.11.1
pillow                            10.3.0
pip                               24.0
platformdirs                      4.2.2
prometheus_client                 0.20.0
prometheus-fastapi-instrumentator 7.0.0
protobuf                          5.26.1
psutil                            5.9.8
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           16.1.0
pyarrow-hotfix                    0.6
pycountry                         24.6.1
pycparser                         2.22
pycryptodome                      3.20.0
pydantic                          2.7.1
pydantic_core                     2.18.2
Pygments                          2.18.0
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-multipart                  0.0.9
pytz                              2024.1
PyYAML                            6.0.1
ray                               2.21.0
referencing                       0.35.1
regex                             2024.5.10
requests                          2.32.3
rich                              13.7.1
rouge                             1.0.1
rpds-py                           0.18.1
safetensors                       0.4.3
scipy                             1.13.0
sentencepiece                     0.2.0
setuptools                        58.1.0
shellingham                       1.5.4
simplejson                        3.19.2
six                               1.16.0
sniffio                           1.3.1
sortedcontainers                  2.4.0
starlette                         0.37.2
sympy                             1.12
tiktoken                          0.6.0
tokenizers                        0.19.1
tomli                             2.0.1
torch                             2.3.0
tqdm                              4.66.4
transformers                      4.40.2
triton                            2.3.0
typer                             0.12.3
typing_extensions                 4.11.0
tzdata                            2024.1
ujson                             5.9.0
urllib3                           2.2.1
uvicorn                           0.29.0
uvloop                            0.19.0
vllm                              0.5.0
vllm-flash-attn                   2.5.9
vllm_nccl_cu12                    2.18.1.0.4.0
watchfiles                        0.21.0
websockets                        12.0
wheel                             0.43.0
xformers                          0.0.26.post1
xxhash                            3.4.1
yapf                              0.40.2
yarl                              1.9.4
zipp                              3.19.2
QwenLM / Qwen2.5

Qwen2-57B-A14B-Instruct-GPTQ-Int4推理极慢 #559