internlm-xcomposer2-vl-7b-4bit 推理速度慢

在T4卡上运行，显存占用约10G，推理耗时约25秒/次，和sharegpt-13b的速度差异非常大（推理速度约2秒/次），不知道是哪里使用有问题？同时发现在模型加载过程中有关于 auto_gptq 的告警，auto_gptq 安装的版本是 0.7.0，告警信息如下：

CUDA extension not installed.
CUDA extension not installed.
WARNING - Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
WARNING - CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.

如果更换 auto_gptq-0.4.2+cu117-cp310-cp310-linux_x86_64 包，模型加载失败并报错：

ValueError: QuantLinear() does not have a parameter or a buffer named weight.

InternLM / InternLM-XComposer

internlm-xcomposer2-vl-7b-4bit 推理速度慢 #186