Closed zhengyangyong closed 3 months ago
双4090D显卡,CUDA:12.4,按官方代码执行,非常简单的推理居然要两三分钟,期间GPU使用率一直打到70%
Package Version --------------------------------- ------------ accelerate 0.31.0 addict 2.4.0 aiohttp 3.9.5 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.1 aliyun-python-sdk-kms 2.16.3 annotated-types 0.6.0 anyio 4.3.0 async-timeout 4.0.3 attrs 23.2.0 auto_gptq 0.7.1 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.3 coloredlogs 15.0.1 crcmod 1.7 cryptography 42.0.8 datasets 2.18.0 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.1.1 exceptiongroup 1.2.1 fastapi 0.111.0 fastapi-cli 0.0.3 filelock 3.14.0 flash-attn 2.5.9.post1 frozenlist 1.4.1 fsspec 2024.2.0 gast 0.5.4 gekko 1.1.1 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.0 humanfriendly 10.0 idna 3.7 importlib_metadata 7.1.0 interegular 0.3.3 Jinja2 3.1.4 jmespath 0.10.0 joblib 1.4.2 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 lark 1.1.9 llvmlite 0.42.0 lm-format-enforcer 0.10.1 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 modelscope 1.15.0 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.2.1 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.550.52 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 openai 1.28.1 optimum 1.20.0 orjson 3.10.3 oss2 2.18.5 outlines 0.0.43 packaging 24.0 pandas 2.2.2 peft 0.11.1 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 protobuf 5.26.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 16.1.0 pyarrow-hotfix 0.6 pycountry 24.6.1 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 ray 2.21.0 referencing 0.35.1 regex 2024.5.10 requests 2.32.3 rich 13.7.1 rouge 1.0.1 rpds-py 0.18.1 safetensors 0.4.3 scipy 1.13.0 sentencepiece 0.2.0 setuptools 58.1.0 shellingham 1.5.4 simplejson 3.19.2 six 1.16.0 sniffio 1.3.1 sortedcontainers 2.4.0 starlette 0.37.2 sympy 1.12 tiktoken 0.6.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 typer 0.12.3 typing_extensions 4.11.0 tzdata 2024.1 ujson 5.9.0 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 vllm 0.5.0 vllm-flash-attn 2.5.9 vllm_nccl_cu12 2.18.1.0.4.0 watchfiles 0.21.0 websockets 12.0 wheel 0.43.0 xformers 0.0.26.post1 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.19.2
https://github.com/QwenLM/Qwen2/issues/552
In addition, you were using transformers on multiple GPUs which would be even slower.
双4090D显卡,CUDA:12.4,按官方代码执行,非常简单的推理居然要两三分钟,期间GPU使用率一直打到70%