TorchMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.
Apache License 2.0
107 stars 8 forks source link

CPU memory problem when using gptq quantization #28

Closed JustQJ closed 3 months ago

JustQJ commented 3 months ago

Hi, when running the code in the readme, I find the cpu memory useage is much higher than expected. After reading the code, I found that the __init__ function of the QuantLinear() class in the auto_gptq was not being overridden as it should be. Therefore, I add the following code in the model_offload.py and fix the problem.

from auto_gptq.nn_modules.qlinear.qlinear_cuda import QuantLinear
from auto_gptq.nn_modules.qlinear.qlinear_cuda_old import QuantLinear as QuantLinearOld

def __enter__(self):
        .....
        .....
        QuantLinear._old_init = QuantLinear.__init__
        QuantLinear.__init__ = param_init_decorator(QuantLinear.__init__)
        QuantLinearOld._old_init = QuantLinearOld.__init__
        QuantLinearOld.__init__ = param_init_decorator(QuantLinearOld.__init__)

def __exit__(self, exc_type, exc_value, traceback):

        .....
        .....

        QuantLinear.__init__ = QuantLinear._old_init
        QuantLinearOld.__init__ = QuantLinearOld._old_init