InternLM / InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Apache License 2.0
2.47k stars 153 forks source link

internlm-xcomposer2-vl-7b-4bit 推理速度慢 #186

Open wanghanyang123 opened 7 months ago

wanghanyang123 commented 7 months ago

在T4卡上运行,显存占用约10G,推理耗时约25秒/次,和sharegpt-13b的速度差异非常大(推理速度约2秒/次),不知道是哪里使用有问题? 同时发现在模型加载过程中有关于 auto_gptq 的告警,auto_gptq 安装的版本是 0.7.0,告警信息如下:

CUDA extension not installed.
CUDA extension not installed.
WARNING - Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
WARNING - CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.

如果更换 auto_gptq-0.4.2+cu117-cp310-cp310-linux_x86_64 包,模型加载失败并报错:

ValueError: QuantLinear() does not have a parameter or a buffer named weight.