QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.
5.98k stars 336 forks source link

Qwen2-57B-A14B-Instruct-GPTQ-Int4推理极慢 #559

Closed zhengyangyong closed 3 weeks ago

zhengyangyong commented 3 weeks ago

双4090D显卡,CUDA:12.4,按官方代码执行,非常简单的推理居然要两三分钟,期间GPU使用率一直打到70%

1718190051233

Package                           Version
--------------------------------- ------------
accelerate                        0.31.0
addict                            2.4.0
aiohttp                           3.9.5
aiosignal                         1.3.1
aliyun-python-sdk-core            2.15.1
aliyun-python-sdk-kms             2.16.3
annotated-types                   0.6.0
anyio                             4.3.0
async-timeout                     4.0.3
attrs                             23.2.0
auto_gptq                         0.7.1
certifi                           2024.2.2
cffi                              1.16.0
charset-normalizer                3.3.2
click                             8.1.7
cloudpickle                       3.0.0
cmake                             3.29.3
coloredlogs                       15.0.1
crcmod                            1.7
cryptography                      42.0.8
datasets                          2.18.0
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.6.1
einops                            0.8.0
email_validator                   2.1.1
exceptiongroup                    1.2.1
fastapi                           0.111.0
fastapi-cli                       0.0.3
filelock                          3.14.0
flash-attn                        2.5.9.post1
frozenlist                        1.4.1
fsspec                            2024.2.0
gast                              0.5.4
gekko                             1.1.1
h11                               0.14.0
httpcore                          1.0.5
httptools                         0.6.1
httpx                             0.27.0
huggingface-hub                   0.23.0
humanfriendly                     10.0
idna                              3.7
importlib_metadata                7.1.0
interegular                       0.3.3
Jinja2                            3.1.4
jmespath                          0.10.0
joblib                            1.4.2
jsonschema                        4.22.0
jsonschema-specifications         2023.12.1
lark                              1.1.9
llvmlite                          0.42.0
lm-format-enforcer                0.10.1
markdown-it-py                    3.0.0
MarkupSafe                        2.1.5
mdurl                             0.1.2
modelscope                        1.15.0
mpmath                            1.3.0
msgpack                           1.0.8
multidict                         6.0.5
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.2.1
ninja                             1.11.1.1
numba                             0.59.1
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 8.9.2.26
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.550.52
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.1.105
openai                            1.28.1
optimum                           1.20.0
orjson                            3.10.3
oss2                              2.18.5
outlines                          0.0.43
packaging                         24.0
pandas                            2.2.2
peft                              0.11.1
pillow                            10.3.0
pip                               24.0
platformdirs                      4.2.2
prometheus_client                 0.20.0
prometheus-fastapi-instrumentator 7.0.0
protobuf                          5.26.1
psutil                            5.9.8
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           16.1.0
pyarrow-hotfix                    0.6
pycountry                         24.6.1
pycparser                         2.22
pycryptodome                      3.20.0
pydantic                          2.7.1
pydantic_core                     2.18.2
Pygments                          2.18.0
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-multipart                  0.0.9
pytz                              2024.1
PyYAML                            6.0.1
ray                               2.21.0
referencing                       0.35.1
regex                             2024.5.10
requests                          2.32.3
rich                              13.7.1
rouge                             1.0.1
rpds-py                           0.18.1
safetensors                       0.4.3
scipy                             1.13.0
sentencepiece                     0.2.0
setuptools                        58.1.0
shellingham                       1.5.4
simplejson                        3.19.2
six                               1.16.0
sniffio                           1.3.1
sortedcontainers                  2.4.0
starlette                         0.37.2
sympy                             1.12
tiktoken                          0.6.0
tokenizers                        0.19.1
tomli                             2.0.1
torch                             2.3.0
tqdm                              4.66.4
transformers                      4.40.2
triton                            2.3.0
typer                             0.12.3
typing_extensions                 4.11.0
tzdata                            2024.1
ujson                             5.9.0
urllib3                           2.2.1
uvicorn                           0.29.0
uvloop                            0.19.0
vllm                              0.5.0
vllm-flash-attn                   2.5.9
vllm_nccl_cu12                    2.18.1.0.4.0
watchfiles                        0.21.0
websockets                        12.0
wheel                             0.43.0
xformers                          0.0.26.post1
xxhash                            3.4.1
yapf                              0.40.2
yarl                              1.9.4
zipp                              3.19.2
zhengyangyong commented 3 weeks ago

image

jklj077 commented 3 weeks ago

https://github.com/QwenLM/Qwen2/issues/552

In addition, you were using transformers on multiple GPUs which would be even slower.