ModelCloud / GPTQModel

An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).
Apache License 2.0
87 stars 18 forks source link

[BUG]AttributeError: 'NoneType' object has no attribute 'parameters' #366

Open bf96163 opened 1 month ago

bf96163 commented 1 month ago

Describe the bug

when run line:model.quantize(examples) got AttributeError: 'NoneType' object has no attribute 'parameters'

GPU Info

Show output of:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:D6:00.0 Off |                  N/A |
| 30%   32C    P8             26W /  350W |       0MiB /  22000MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Software Info

nvidia-docker inside ubuntu(22.04) host: container-image(nvidia/cuda:12.1.0-runtime-ubuntu22.04) host has cuda 12.5 driver 555.58 Show output of:

Name: gptqmodel
Version: 1.0.0+cu1241torch2.4
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: qubitium@modelcloud.ai
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, auto-round, datasets, gekko, huggingface-hub, intel-extension-for-transformers, lm-eval, ninja, numpy, packaging, protobuf, rouge, safetensors, sentencepiece, threadpoolctl, torch, tqdm, transformers, triton
Required-by:
---
Name: torch
Version: 2.4.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, auto-round, auto_gptq, gptqmodel, lm_eval, peft
---
Name: transformers
Version: 4.44.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto-round, auto_gptq, gptqmodel, intel-extension-for-transformers, lm_eval, peft
---
Name: accelerate
Version: 0.33.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto-round, auto_gptq, gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License:
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock
Required-by: gptqmodel, torch

If you are reporting an inference bug of a post-quantized model, please post the content of config.json and quantize_config.json.

To Reproduce run following script

from transformers import AutoTokenizer, TextGenerationPipeline
# from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from gptqmodel import GPTQModel as AutoGPTQForCausalLM
from gptqmodel import QuantizeConfig as BaseQuantizeConfig
import random,json,os ,shutil
import loguru

model_in_dir = "/workspace/model/ZhipuAI/chatglm3-6b"
model_out_dir = "/baifan/quant/glm3-6b_gptq"
examples_list = ["./alpaca_en_demo.json","./alpaca_zh_demo.json"]
max_samples = 1000

def copy_folder_without(source,target,without=[".safetensors",".bin"]):

    if not os.path.isdir(target):
        os.mkdir(target)

    for foldername, subfolders, filenames in os.walk(source):
        for filename in filenames:
            file_path = os.path.join(foldername, filename)
            for suffix in without:
                if filename.endswith(suffix):
                    break
            else:
                target_file_path = os.path.join(target, filename)
                shutil.copy2(file_path, target_file_path)

def load_examples(file_path_list:list,max:int=1000)->list :
    examples = []
    for path in file_path_list:
        with open(path,"r",encoding="utf-8") as f:
            templist = json.load(f)
            examples.extend(templist)

    random.shuffle(examples)
    if len(examples)<max:
        return examples 
    else:
        return examples[:max]

if __name__=="__main__":
    copy_folder_without(model_in_dir,model_out_dir)
    tokenizer = AutoTokenizer.from_pretrained(model_in_dir, use_fast=True,trust_remote_code=True)
    quantize_config = BaseQuantizeConfig(
        # format="gptq_v2",
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        # desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    )

    # load un-quantized model, by default, the model will always be loaded into CPU memory
    model = AutoGPTQForCausalLM.from_pretrained(model_in_dir, quantize_config,trust_remote_code=True,device_map="cpu") #,device_map="cpu"

    # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
    myexamples = load_examples(examples_list,max=max_samples)
    # for i in range(0,len(myexamples),4):
    #     examples = [tokenizer(myexamples[i+j]["instruction"]+myexamples[i+j]["input"]) for j in range(4)]
    # examples = [tokenizer(x["instruction"]+x["input"]) for x in myexamples]
    examples = [tokenizer("世界上最遥远的路程一定是最难走的路,但是每个人都会选择的路。")]
    model.quantize(examples)

    # save quantized model
    # model.save_quantized(model_out_dir)

    # save quantized model using safetensors
    model.save_quantized(model_out_dir, use_safetensors=True)

Expected behavior

model.quantize(examples) can run with out error

Model/Datasets

https://huggingface.co/THUDM/chatglm3-6b

Screenshots

root@bf-llm-xinfer-vllm-pod-1-6d69cc5b69-hxpbq:/baifan/quant# python3 quantization_gptq.py

[HAMI-core Warn(83428:140567479620672:utils.c:183)]: get default cuda from (null)

{'device_map': 'cpu', 'trust_remote_code': True, 'torch_dtype': torch.float16} Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 7/7 [00:01<00:00, 4.94it/s] WARNING - Calibration dataset size should be greater than 256. Current size: 1. WARNING - The average length of input_ids of calibration_dataset should be greater than 256: actual avg: 19.0. WARNING - Model config does not have pad token mapped. Please pass in tokenizer to quantize() so GPTQModel can auto-select the best pad token. Traceback (most recent call last): File "/baifan/quant/quantization_gptq.py", line 115, in model.quantize(examples) File "/usr/local/lib/python3.10/dist-packages/gptqmodel/models/base.py", line 410, in quantize self.model(example) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 937, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 830, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 640, in forward layer_ret = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/usr/local/lib/python3.10/dist-packages/gptqmodel/models/base.py", line 367, in store_input_hook layer_input.append(move_to(inp, data_device)) File "/usr/local/lib/python3.10/dist-packages/gptqmodel/utils/model.py", line 74, in move_to if get_device(obj) != device: File "/usr/local/lib/python3.10/dist-packages/gptqmodel/utils/model.py", line 70, in get_device return next(obj.parameters()).device AttributeError: 'NoneType' object has no attribute 'parameters' [HAMI-core Msg(83428:140567479620672:multiprocess_memory_limit.c:468)]: Calling exit handler 83428

Additional context

already tried (but no effect this time):

  1. change transformers version from 4.44 to 4.43 (bugs reported in glm4 repo)
  2. add {format="gptq_v2"()} to QuantizeConfig (fix model couldn't deepcopy before )
  3. delete {desc_act=False} from QuantizeConfig
LRL-ModelCloud commented 3 weeks ago

@bf96163 look at this pr: https://huggingface.co/THUDM/glm-4-9b/discussions/4/files, you can do the same changes on your local modeling_chatglm.py, it should have fixed your NoneType issue.

bf96163 commented 3 weeks ago

@bf96163 look at this pr: https://huggingface.co/THUDM/glm-4-9b/discussions/4/files, you can do the same changes on your local modeling_chatglm.py, it should have fixed your NoneType issue.

Sorry, I tried the newest version of modeling_chatglm.py the problem still there......

File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 998, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 896, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 726, in forward layer_ret = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/usr/local/lib/python3.10/dist-packages/gptqmodel/models/base.py", line 367, in store_input_hook layer_input.append(move_to(inp, data_device)) File "/usr/local/lib/python3.10/dist-packages/gptqmodel/utils/model.py", line 74, in move_to if get_device(obj) != device: File "/usr/local/lib/python3.10/dist-packages/gptqmodel/utils/model.py", line 70, in get_device return next(obj.parameters()).device AttributeError: 'NoneType' object has no attribute 'parameters'

JACKYLUO1991 commented 1 week ago

@bf96163 Have u solved it?