import` os
## visable gpu
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from PIL import Image
import os
import json
import pickle
from tqdm import tqdm
from modelscope import AutoModelForCausalLM, AutoTokenizer
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
MODEL_PATH = "/data3/lisibo/.cache/modelscope/hub/ZhipuAI/cogvlm2-llama3-chinese-chat-19B-int4"
# MODEL_PATH= "ZhipuAI/cogvlm2-llama3-chinese-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
print("TORCH_TYPE:", TORCH_TYPE)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
low_cpu_mem_usage=True,
).eval()
The codes above is the same as the int4 model card on huggingface.
I am using this code to directly load the 4-bit checkpoint, and in my expectation, that does not need quantizing while loading the model. So It should be faster. However, it seems that errors occur when loading the model. Logs are in the following.
2024-09-03 20:35:38,687 - modelscope - INFO - PyTorch version 2.3.0 Found.
2024-09-03 20:35:38,689 - modelscope - INFO - Loading ast index from /data3/lisibo/.cache/modelscope/ast_indexer
2024-09-03 20:35:38,725 - modelscope - INFO - Loading done! Current index file version is 1.14.0, with md5 3753725eddbea1b58b893b7ccc61de0b and a total number of 976 components indexed
/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/utils/generic.py:260: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
TORCH_TYPE: torch.bfloat16
Traceback (most recent call last):
File "/data3/lisibo/euluc/CogVLM2/basic_demo/cli_demo_3.py", line 27, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 113, in from_pretrained
module_obj = module_class.from_pretrained(model_dir, *model_args,
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
return model_class.from_pretrained(
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 76, in from_pretrained
return ori_from_pretrained(cls, model_dir, *model_args, **kwargs)
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3091, in from_pretrained
) = cls._load_pretrained_model(
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3471, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 744, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py", line 116, in set_module_quantized_tensor_to_device
new_value = nn.Parameter(new_value, requires_grad=old_value.requires_grad)
File "/data3/lisibo/.conda/envs/py310/lib/python3.10/site-packages/torch/nn/parameter.py", line 40, in __new__
return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients
Expected behavior / 期待表现
Actually I am using the same code in May, and it worked.
However, when I need to restart the project recently, it failed to load the model. Enviornment seems not be modified after 2024.6.9
System Info / 系統信息
Cuda==12.4, Transformers==4.32.0, torch==2.3.0, xformers==0.0.26.post1, triton==2.3.0 Device = 3090/4090
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
The codes above is the same as the int4 model card on huggingface. I am using this code to directly load the 4-bit checkpoint, and in my expectation, that does not need quantizing while loading the model. So It should be faster. However, it seems that errors occur when loading the model. Logs are in the following.
Expected behavior / 期待表现
Actually I am using the same code in May, and it worked. However, when I need to restart the project recently, it failed to load the model. Enviornment seems not be modified after 2024.6.9