[Feature | Bug Fix] <ChatGLM2-6B-int4 使用CPU部署报错：找不到文件quantization_kernels_parallel.so>

Is your feature request related to a problem? Please describe.

按照Readme的描述使用CPU推理ChatGLM2-6B-int4量化版本时报错，报错信息如下：

已完成的步骤：

将模型下载至本地并使用本地路径
改用.float()使用cpu
已安装[TDM-GCC](https://jmeubank.github.io/tdm-gcc/)，且勾选了OpenMP

Solutions

我的解决思路是运行ChatGLM-6b-int4，如果ChatGLM-6b-int4可以运行，那么可以参照着ChatGLM-6b-int一步步调试以最终跑通ChatGLM2-6b-int4。

结果是发现ChatGLM-6b-int4也跑不通，不过已经有一些相关的[issue](https://github.com/THUDM/ChatGLM-6B/issues/166)。

参考其他issue我解决了一个问题：编译出来的quantization_kernels_parallel.so和quantization_kernels.so其实并不能用。因此上面的报错本质上其实不是文件找不到，而是文件无法加载。（相关代码在quantization.py的CPUKernel类，包括编译和加载CPU Kernel）

经过仔细阅读源码，以及参考ChatGLM-6B中的issue，我发现其实解决的办法很简单。

对于ChatGLM-6b-int4，出现问题的原因只有：编译出来的so文件有问题，因而无法被加载。（因为后来手动编译的so文件和原始的so文件大小差异明显）

关于为什么按照教程走，但是编译出的文件有问题，我想很大概率是因为我的电脑中安装了不止一个gcc，包括MingGW64以及Cygwin。因此很可能找到TDM-GCC的gcc路径，用绝对路径去手动编译可以获得可用的so文件。但是我没有尝试，而是用了上面issue中[gongjimin推荐的gcc](https://github.com/skeeto/w64devkit/releases)。

所以解决办法就2步：编译正确的so文件和加载正确的so文件。

有人可能有疑问：编译好正确的so文件放到所需路径不就可以了吗？我最初也是这样想的，但无奈的发现程序会重新编译c文件，因此可用的so会被覆盖再次变成不可用的so文件。

编译正确的so文件：刚才也提到了，使用[该项目](https://github.com/skeeto/w64devkit/releases)的GCC编译即可。命令如下：

gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels.c -shared -o quantization_kernels.so
gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so

加载正确的so文件：在加载模型的代码后加一句加载量化模型所需kernel的代码，即

tokenizer = AutoTokenizer.from_pretrained("chatglm2-6b-int4", trust_remote_code=True, revision="v1.0")
model = AutoModel.from_pretrained("chatglm2-6b-int4", trust_remote_code=True, revision="v1.0").float()  #.cuda()
model = model.quantize(bits=4, kernel_file=r"E:\Code\PyCharm\PyCharmProjects\ChatGLM2\chatglm2-6b-int4\quantization_kernels.so")

kernel_file为你编译好的so文件路径，亲测quantization_kernels_parallel.so和quantization_kernels.so都可以运行。

如果模型是ChatGLM-6b-int4，那么到这里就可以运行了。

但是ChatGLM2-6b-int4还不行，为什么呢？我也很疑惑，我想既然chatglm可以运行了，为什么chatglm2还是有问题。于是我在模型加载kernel的部分单步调试，最终发现了：哦！原来chatglm2直接把CPU的量化版本加载kernel的代码删除了！不知道是不是因为太少人用CPU的量化模型部署了。

于是我按照chatglm的代码，把加载kernel的代码加上就可以运行了。修改的代码不是很多，就两段。

第一段是在modeling_chatglm.py中修改最后一个函数quantize中的

from .quantization import quantize

if self.quantized:
    logger.info("Already quantized.")
    return self

修改为（参照chatglm）：

from .quantization import quantize, load_cpu_kernel

if self.quantized:
    if self.device == torch.device("cpu"):
        logger.info("Already quantized, reloading cpu kernel.")
        load_cpu_kernel(**kwargs)
    else:
        logger.info("Already quantized.")
        return self

    self.quantized = True

第二段是在quantization.py的这条语句

cpu_kernels = CPUKernel()

后面加上所需的load_cpu_kernel函数

cpu_kernels = CPUKernel()

def load_cpu_kernel(**kwargs):
    global cpu_kernels
    cpu_kernels = CPUKernel(**kwargs)

OK，到这里就可以运行了。

Additional context

省流：

编译正确的so文件：使用[项目](https://github.com/skeeto/w64devkit/releases)中的gcc编译c文件生成so文件

加载正确的so文件：

tokenizer = AutoTokenizer.from_pretrained("chatglm2-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("chatglm2-6b-int4", trust_remote_code=True).float()
model = model.quantize(bits=4, kernel_file="xxx\quantization_kernels.so")

修改modeling_chatglm.py的函数quantize中的

from .quantization import quantize

if self.quantized:
    logger.info("Already quantized.")
    return self

为：

from .quantization import quantize, load_cpu_kernel

if self.quantized:
    if self.device == torch.device("cpu"):
        logger.info("Already quantized, reloading cpu kernel.")
        load_cpu_kernel(**kwargs)
    else:
        logger.info("Already quantized.")
        return self

    self.quantized = True

在quantization.py的这条语句

cpu_kernels = CPUKernel()

后面加上load_cpu_kernel函数

cpu_kernels = CPUKernel()

def load_cpu_kernel(**kwargs):
    global cpu_kernels
    cpu_kernels = CPUKernel(**kwargs)

经测试，在CPU为Intel(R) Core(TM) i5-10200H CPU @ 2.40GHz 2.40 GHz的设备上，推理速度为0.1字/s，占用内存5G~5.4G

Compile parallel cpu kernel gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels_parallel.c -shared -o /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels_parallel.so failed.
Compile cpu kernel gcc -O3 -fPIC -std=c99 /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels.c -shared -o /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels.so failed.

遇到了一个类似错误，属于编译失败，参考上面方法无果。

于是我在终端手动执行编译

gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels_parallel.c -shared -o /home/app/.cache/huggingface/modules/transformers_modules/chatgml2-6b-int4/quantization_kernels_parallel.so

发现是成功的，而在 python 脚本中会提示上面错误，有点奇怪，只能是shell 和 python 脚本执行环境差异问题～～

找到 quantization.py 文件的 CPUKernel 类，其中编译使用的是如下逻辑

  if sys.platform != 'darwin':
                            compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
                                source_code, kernel_file)
                        else:
                            compile_command = "clang -O3 -fPIC -pthread -Xclang -fopenmp -lomp -std=c99 {} -shared -o {}".format(
                                source_code, kernel_file)
                        exit_state = os.system(compile_command)
                        if not exit_state:
                            try:
                                kernels = ctypes.cdll.LoadLibrary(kernel_file)
                            except:
                                logger.warning(
                                    f"Load parallel cpu kernel failed {kernel_file}: {traceback.format_exc()}")
                        else:
                            logger.warning(f"Compile parallel cpu kernel {compile_command} failed.")

其中利用 os.system 返回的 exit_state 不为 0 则异常，但是没有捕获到更详细的编译错误信息。于是尝试改成 subprocess 执行，在 CPUKenel 类中增加一个方法 execute_command 替代 os.system 执行 gcc 命令，打印出更详细的编译信息。

import subprocess

class CPUKernel:
    # ... [其他代码保持不变]

    def execute_command(self, command):
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        if result.returncode != 0:
            logger.warning(f"Command '{command}' failed with error:\n{result.stderr}")
        return result.returncode

    def __init__(self, kernel_file="", source_code=default_cpu_kernel_code_path, compile_parallel_kernel=None,
                 parallel_num=None):
        # ... [其他代码保持不变]

        if (not kernel_file) or (not os.path.exists(kernel_file)):
            try:
                if os.path.exists(source_code):
                    kernel_file = source_code[:-2] + ".so"

                    if compile_parallel_kernel:
                        if sys.platform != 'darwin':
                            compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
                                source_code, kernel_file)
                        else:
                            compile_command = "clang -O3 -fPIC -pthread -Xclang -fopenmp -lomp -std=c99 {} -shared -o {}".format(
                                source_code, kernel_file)
                        exit_state = self.execute_command(compile_command)
                        if not exit_state:
                            try:
                                kernels = ctypes.cdll.LoadLibrary(kernel_file)
                            except:
                                logger.warning(
                                    f"Load parallel cpu kernel failed {kernel_file}: {traceback.format_exc()}")

                        if kernels is None:  # adjust config, use default cpu kernel
                            compile_parallel_kernel = False
                            source_code = default_cpu_kernel_code_path
                            kernel_file = source_code[:-2] + ".so"

                    if kernels is None:
                        compile_command = "gcc -O3 -fPIC -std=c99 {} -shared -o {}".format(source_code, kernel_file)
                        exit_state = self.execute_command(compile_command)
                        if not exit_state:
                            try:
                                kernels = ctypes.cdll.LoadLibrary(kernel_file)
                            except:
                                logger.warning(f"Load cpu kernel {kernel_file} failed: {traceback.format_exc()}")
                else:
                    logger.warning("Kernel source code not found.")
                    return
            except:
                logger.warning(f"Failed to build cpu kernel: {traceback.format_exc()}")
                return
        else:
            try:
                kernels = ctypes.cdll.LoadLibrary(kernel_file)
            except:
                logger.warning(f"Load custom cpu kernel {kernel_file} failed: {traceback.format_exc()}")

        # ... [其他代码保持不变]

本来期待用 subprocess 打印出错误，但是发现改成这样报错没有了。

可能原因是os.system在的子shell 和终端的环境变量不一样。而 subprocess 又能编译 gcc ，里面细节不清楚，但这样解决了问题，有遇到类似问题的同学可以参考下。

Is your feature request related to a problem? Please describe.

按照Readme的描述使用CPU推理ChatGLM2-6B-int4量化版本时报错，报错信息如下：

已完成的步骤：

将模型下载至本地并使用本地路径

改用.float()使用cpu

已安装[TDM-GCC](https://jmeubank.github.io/tdm-gcc/)，且勾选了OpenMP](https://jmeubank.github.io/tdm-gcc/)%EF%BC%8C%E4%B8%94%E5%8B%BE%E9%80%89%E4%BA%86OpenMP)

Solutions

我的解决思路是运行ChatGLM-6b-int4，如果ChatGLM-6b-int4可以运行，那么可以参照着ChatGLM-6b-int一步步调试以最终跑通ChatGLM2-6b-int4。

结果是发现ChatGLM-6b-int4也跑不通，不过已经有一些相关的[issue](https://github.com/THUDM/ChatGLM-6B/issues/166)。](https://github.com/THUDM/ChatGLM-6B/issues/166)%E3%80%82)

参考其他issue我解决了一个问题：编译出来的quantization_kernels_parallel.so和quantization_kernels.so其实并不能用。因此上面的报错本质上其实不是文件找不到，而是文件无法加载。（相关代码在quantization.py的CPUKernel类，包括编译和加载CPU Kernel）

经过仔细阅读源码，以及参考ChatGLM-6B中的issue，我发现其实解决的办法很简单。

对于ChatGLM-6b-int4，出现问题的原因只有：编译出来的so文件有问题，因而无法被加载。（因为后来手动编译的so文件和原始的so文件大小差异明显）

关于为什么按照教程走，但是编译出的文件有问题，我想很大概率是因为我的电脑中安装了不止一个gcc，包括MingGW64以及Cygwin。因此很可能找到TDM-GCC的gcc路径，用绝对路径去手动编译可以获得可用的so文件。但是我没有尝试，而是用了上面issue中[gongjimin推荐的gcc](https://github.com/skeeto/w64devkit/releases)。](https://github.com/skeeto/w64devkit/releases)%E3%80%82)

所以解决办法就2步：编译正确的so文件和加载正确的so文件。

有人可能有疑问：编译好正确的so文件放到所需路径不就可以了吗？我最初也是这样想的，但无奈的发现程序会重新编译c文件，因此可用的so会被覆盖再次变成不可用的so文件。

编译正确的so文件：刚才也提到了，使用[该项目](https://github.com/skeeto/w64devkit/releases)的GCC编译即可。命令如下：](https://github.com/skeeto/w64devkit/releases)%E7%9A%84GCC%E7%BC%96%E8%AF%91%E5%8D%B3%E5%8F%AF%E3%80%82%E5%91%BD%E4%BB%A4%E5%A6%82%E4%B8%8B%EF%BC%9A)
gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels.c -shared -o quantization_kernels.so
gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so
加载正确的so文件：在加载模型的代码后加一句加载量化模型所需kernel的代码，即
tokenizer = AutoTokenizer.from_pretrained("chatglm2-6b-int4", trust_remote_code=True, revision="v1.0")
model = AutoModel.from_pretrained("chatglm2-6b-int4", trust_remote_code=True, revision="v1.0").float()  #.cuda()
model = model.quantize(bits=4, kernel_file=r"E:\Code\PyCharm\PyCharmProjects\ChatGLM2\chatglm2-6b-int4\quantization_kernels.so")
kernel_file为你编译好的so文件路径，亲测quantization_kernels_parallel.so和quantization_kernels.so都可以运行。

如果模型是ChatGLM-6b-int4，那么到这里就可以运行了。

但是ChatGLM2-6b-int4还不行，为什么呢？我也很疑惑，我想既然chatglm可以运行了，为什么chatglm2还是有问题。于是我在模型加载kernel的部分单步调试，最终发现了：哦！原来chatglm2直接把CPU的量化版本加载kernel的代码删除了！不知道是不是因为太少人用CPU的量化模型部署了。

于是我按照chatglm的代码，把加载kernel的代码加上就可以运行了。修改的代码不是很多，就两段。

第一段是在modeling_chatglm.py中修改最后一个函数quantize中的
from .quantization import quantize

if self.quantized:
    logger.info("Already quantized.")
    return self
修改为（参照chatglm）：
from .quantization import quantize, load_cpu_kernel

if self.quantized:
    if self.device == torch.device("cpu"):
        logger.info("Already quantized, reloading cpu kernel.")
        load_cpu_kernel(**kwargs)
    else:
        logger.info("Already quantized.")
        return self

    self.quantized = True
第二段是在quantization.py的这条语句
cpu_kernels = CPUKernel()
后面加上所需的load_cpu_kernel函数
cpu_kernels = CPUKernel()

def load_cpu_kernel(**kwargs):
    global cpu_kernels
    cpu_kernels = CPUKernel(**kwargs)
OK，到这里就可以运行了。

Additional context

省流：
编译正确的so文件：使用[项目](https://github.com/skeeto/w64devkit/releases)中的gcc编译c文件生成so文件](https://github.com/skeeto/w64devkit/releases)%E4%B8%AD%E7%9A%84gcc%E7%BC%96%E8%AF%91c%E6%96%87%E4%BB%B6%E7%94%9F%E6%88%90so%E6%96%87%E4%BB%B6)
加载正确的so文件：
tokenizer = AutoTokenizer.from_pretrained("chatglm2-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("chatglm2-6b-int4", trust_remote_code=True).float()
model = model.quantize(bits=4, kernel_file="xxx\quantization_kernels.so")
修改modeling_chatglm.py的函数quantize中的
from .quantization import quantize

if self.quantized:
   logger.info("Already quantized.")
   return self
为：
from .quantization import quantize, load_cpu_kernel

if self.quantized:
   if self.device == torch.device("cpu"):
       logger.info("Already quantized, reloading cpu kernel.")
       load_cpu_kernel(**kwargs)
   else:
       logger.info("Already quantized.")
       return self

   self.quantized = True
在quantization.py的这条语句
cpu_kernels = CPUKernel()
后面加上load_cpu_kernel函数
cpu_kernels = CPUKernel()

def load_cpu_kernel(**kwargs):
   global cpu_kernels
   cpu_kernels = CPUKernel(**kwargs)

试了一下，还行不行呀。 Load parallel cpu kernel failed C:\Users\thtfpc.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization_kernels_parallel.so: Traceback (most recent call last): File "C:\Users\thtfpc/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization.py", line 148, in init kernels = ctypes.cdll.LoadLibrary(kernel_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\thtfpc\anaconda3\Lib\ctypes__init__.py", line 454, in LoadLibrary return self._dlltype(name) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\thtfpc\anaconda3\Lib\ctypes__init.py", line 376, in init__ self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [WinError 193] %1 不是有效的 Win32 应用程序。

Load cpu kernel C:\Users\thtfpc.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization_kernels.so failed: Traceback (most recent call last): File "C:\Users\thtfpc/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization.py", line 165, in init kernels = ctypes.cdll.LoadLibrary(kernel_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\thtfpc\anaconda3\Lib\ctypes__init__.py", line 454, in LoadLibrary return self._dlltype(name) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\thtfpc\anaconda3\Lib\ctypes__init.py", line 376, in init__ self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [WinError 193] %1 不是有效的 Win32 应用程序。

Traceback (most recent call last): File "E:\GPT\chatglm2-6b-int4\openai_api.py", line 172, in model = model.quantize(bits=4, kernel_file=r"E:\GPT\chatglm2-6b-int4\quantization_kernels.so") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\thtfpc/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 1209, in quantize self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: quantize() got an unexpected keyword argument 'kernel_file'

THUDM / ChatGLM2-6B