Closed wengyuan722 closed 5 months ago
你好,int4模型是用auto-gptq量化得到的,不能用AutoModelForCausalLM.from_pretrained
加载,可以这样加载
AutoGPTQForCausalLM.from_quantized(model_path,
inject_fused_attention=False,
inject_fused_mlp=False,
use_safetensors=True,
use_cuda_fp16=True,
disable_exllama=False,
device_map='auto' # 支持多卡
)
以上是加载的方式,我们没有试过将INT4模型作为底座进行二次微调,如果是资源限制,你可以选择使用qlora + 4bit微调未量化的底座模型,在config.json文件中这样设置:
"peft_type": "qlora",
"quantization": "4bit",
我看千问的gptq模型是可以通过AutoModelForCausalLM.from_pretrained读取的,我参照改了config配置,还是卡住,正常来说千问的gptq可以微调,CodeFuse-CodeLlama-34B-4bits这个模型也可以
我看千问的gptq模型是可以通过AutoModelForCausalLM.from_pretrained读取的,我参照改了config配置,还是卡住,正常来说千问的gptq可以微调,CodeFuse-CodeLlama-34B-4bits这个模型也可以
AutoGPTQForCausalLM.from_pretrained
内部实现有调用AutoModelForCausalLM.from_pretrained()
,除此之外,读取了一些GPTQ的量化config,可以参照此处代码AutoGTPQForCausalLM.from_pretrained实现load for training,用于训练需要把model.eval()
去掉
@twelveand0 我调整成AutoModelForCausalLM.from_pretrained,
layers = find_layers(model)
ignore_layers = ['lm_head', 'model.embed_tokens', 'model.norm']
for name in list(layers.keys()):
if any([name.startswith(ignore_layer) for ignore_layer in ignore_layers]):
del layers[name]
make_quant(
model,
layers,
4,
64,
use_triton=False,
disable_exllama=disable_exllama,
use_cuda_fp16=True,
desc_act=False,
trainable=trainable
)
model.tie_weights()
device_map="auto" device=None
if isinstance(device_map, str) and device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
raise ValueError(
"If passing a string for device_map
, please choose 'auto', 'balanced', 'balanced_low_0' or "
"'sequential'."
)
max_memory = None if isinstance(device_map, dict): max_memory = None else: if device is None and not device_map and not max_memory: device_map = "auto" if device is not None: device = torch.device(device) if not max_memory and not device_map: device_map = {"": device.index if device.type == "cuda" else device.type} if not isinstance(device_map, dict) and device_map != "sequential": max_memory = accelerate.utils.get_balanced_memory( model=model, max_memory=max_memory, no_split_module_classes=['LlamaDecoderLayer'], low_zero=(device_map == "balanced_low_0") ) if not isinstance(device_map, dict): device_map = accelerate.infer_auto_device_map( model, max_memory=max_memory, no_split_module_classes=['LlamaDecoderLayer'] )
accelerate.utils.modeling.load_checkpoint_in_model( model, checkpoint=model_save_name, device_map=device_map, offload_state_dict=True, offload_buffers=True ) model = simple_dispatch_model(model, device_map)
model_config = model.config.to_dict() seq_len_keys = ["max_position_embeddings", "seq_length", "n_positions"] if any([k in model_config for k in seq_len_keys]): for key in seq_len_keys: if key in model_config: model.seqlen = model_config[key] break else: model.seqlen = 4096
model = autogptq_post_init(model, use_act_order=False)
model.eval() 从layers = find_layers(model)开始,后面这些做训练的时候要不要带上,另外这样改,lora训练会不会和普通模型一样
我想对CodeFuse-CodeLlama-34B-4bits进行微调,先把模型读取改成AutoModelForCausalLM.from_pretrained 改了config.json配置,增加了quantization_config,改了模型文件名称, 在model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, device_map='auto', resume_download=True) 还是卡住了,这个怎么解决