pruning 之后使用 无法读取模型 #7

Open JCDemon opened 3 months ago

JCDemon commented 3 months ago

我尝试使用 model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True) 但是会报错 “Traceback (most recent call last): File "/home/ubuntu/test_scripts/", line 154, in main() File "/home/ubuntu/test_scripts/", line 63, in main model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/models/auto/", line 556, in from_pretrained return model_class.from_pretrained( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/", line 3502, in from_pretrained ) = cls._load_pretrained_model( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/", line 3926, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/", line 805, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/accelerate/utils/", line 348, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([2048, 2785]) in "weight" (which has shape torch.Size([2048, 5504])), this look incorrect. ” 我也尝试了在加载的时候添加参数 ignore_mismatched_sizes=True model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, ignore_mismatched_sizes=True)

请问你们在prune模型之后是怎么去加载的呢。 很着急尝试FLAP,期待您的回复,谢谢。

BenchuYee commented 3 months ago

hei JCDemon, if you want to load the model, you can use the pytorch api(torch.load, instead of huggingface api(from_pretrained,save_pretrained).When you finish pruning the model, you use to save it and then use torch.load to load the pruned model.

JCDemon commented 3 months ago

I've solved the loading issue, thank you for your response. I was using FLAP to prune a llama-version of Qwen model (I converted the Qwen 1.8B model into llama2 version beforehand). I found that I can use save_pretrained to save the model in HF format. Then I convert the HF format pruned llama model into Qwen model and I found I couldn't load it using from_pretrained(). Fortunately, I managed to modify the qwen structure that is defined in (by applying dynamic head_size to each layer, and then I finally can load the model using from_pretrained()). Currently, I got problem using the pruned model to generate text. The currently output of the model seems to be some garbled codes instead of correct sentences. I think it could be the reason I didn't modify the word embedding step (I checked the input_ids running in the model, it is correct. But when converted to input_embed, the input_embed seems totally wrong)? any suggestion about this?


the input_ids here is correct, but the inputs_embeds is wrong. I checked "self.wte" it is a nn.embedding object the above print inputs_embeds is torch.Size([1, 12, 2048])", but I think it should be "([1,12,768])"??? Bec I already set the hidden_size of the first layer to 768, I don't know why it is still the original value 2048.

Also, I couldn't use the above-mentioned pruned llama model (which is actually a llama-version Qwen model) to generate text (I used the exactly same code "torch.load" to load the model as you guys put in the github ""), I encountered similar "garbled codes" issue as mentioned earlier. BTW, the tokenizer I used is the Qwen tokenizer.

shwu-nyunai commented 3 months ago

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with but it doesn't seem to save the model for me.


JCDemon commented 3 months ago

sure thing, here is the code I used to prune the llama-version Qwen model and save in HF format. The line "model.save_pretrained(args.save_model, safe_serialization=True)" actually made it work.

"import argparse import os import numpy as np import torch from transformers import AutoTokenizer, AutoModelForCausalLM from models.hf_llama.modeling_llama import LlamaForCausalLM

from importlib.metadata import version

from lib.prune import prune_wanda_sp, prune_flap, prune_magnitude_sp, check_sparsity from lib.eval import eval_ppl

print('torch', version('torch')) print('transformers', version('transformers')) print('accelerate', version('accelerate')) print('# of gpus: ', torch.cuda.device_count())

def get_llm(model, cache_dir="llm_weights"):

model = AutoModelForCausalLM.from_pretrained(

#     model, 
#     torch_dtype=torch.float16, 
#     cache_dir=cache_dir, 
#     low_cpu_mem_usage=True, 
#     device_map="auto"
# )
model = LlamaForCausalLM.from_pretrained(
    # device_map="auto"
for i in range(len(model.model.layers)):
    model.model.layers[i].self_attn.o_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].self_attn.o_proj.bias, device='cpu'))  # 或 'cuda'
    model.model.layers[i].mlp.down_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].mlp.down_proj.bias, device='cpu'))  # 或 'cuda'

model.seqlen = 128
return model

def main(): parser = argparse.ArgumentParser() parser.add_argument('--model', type=str, help='LLaMA model') # Huggingface model name parser.add_argument('--seed', type=int, default=0, help='Seed for sampling the calibration data.') parser.add_argument('--nsamples', type=int, default=2048, help='Number of calibration samples.') parser.add_argument('--pruning_ratio', type=float, default=0, help='Pruning ratio.') parser.add_argument('--remove_heads', type=int, default=8, help='Remove num_heads') parser.add_argument("--metrics", type=str, default="WIFV", choices=["IFV", "WIFV", "WIFN", 'N/A']) parser.add_argument("--structure", type=str, default="AL-AM", choices=["UL-UM", "UL-MM", "AL-MM", "AL-AM", 'N/A']) parser.add_argument("--prune_method", type=str, default="flap", choices=["flap", "wanda_sp", "mag_sp"]) parser.add_argument("--cache_dir", default="llm_weights", type=str) parser.add_argument('--unstr', action="store_true") parser.add_argument('--eval', action="store_true") parser.add_argument('--save_model', type=str, default=None, help='Path to save the pruned model.') args = parser.parse_args()

# Setting seeds for reproducibility

# Build the model and tokenizer
print(f"loading llm model {args.model}")
model = get_llm(args.model, args.cache_dir)
device = torch.device("cuda:0")
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False, trust_remote_code=True)

if "30b" in args.model or "65b" in args.model: # for 30b and 65b we use device_map to load onto multiple A6000 GPUs, thus the processing here.
    device = model.hf_device_map["lm_head"]
print("use device ", device)

# Prune the model
print("pruning starts")
if args.prune_method == "flap":
    if args.metrics == 'N/A':
        raise ValueError("For FLAP pruning, the metrics parameter must be chosen from ['IFV', 'WIFV', 'WIFN']. 'N/A' is not a valid choice.")  
    if args.structure == 'N/A':
        raise ValueError("For FLAP pruning, the compressed model structure parameter must be chosen from ['UL-UM', 'UL-MM', 'AL-MM', 'AL-AM']. 'N/A' is not a valid choice.")  
    prune_flap(args, model, tokenizer, device)
elif args.prune_method == "wanda_sp":
    prune_wanda_sp(args, model, tokenizer, device)
elif args.prune_method == "mag_sp":
    prune_magnitude_sp(args, model, tokenizer, device)

# Check the sparsity of the model
sparsity_ratio = check_sparsity(model)
print(f"sparsity sanity check {sparsity_ratio:.4f}")
print(f"model parameter {sum(p.numel() for p in model.parameters()) / 1000 ** 3:.2f}B")
# Evaluate the model
if args.eval:
    ppl = eval_ppl(model, tokenizer, device)    
    print(f"ppl on wikitext {ppl}")

# Save the model
if args.save_model:
    if not os.path.exists(args.save_model):
    #, f'{args.save_model}/')    
    #, f'{args.save_model}/pruned_model.bin')
    model.save_pretrained(args.save_model, safe_serialization=True)
    # tokenizer.save_pretrained(args.save_model)

if name == 'main': main()"

JCDemon commented 3 months ago

I think that the reason I can use "save_pretrained" to save my model might bec it's not a official llama2 model. The model I used was converted from Qwen 1.8B. Also, I think the issue u mentioned in #9 doesn't make sense as the weight_dict should be modified after pruning. It's weird to output the original model. Maybe you can try checking the weight before using save_pretrained?