CASIA-IVA-Lab / FLAP

[AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models
https://arxiv.org/abs/2312.11983
Apache License 2.0
30 stars 4 forks source link

pruning 之后使用 无法读取模型 #7

Open JCDemon opened 3 months ago

JCDemon commented 3 months ago

我尝试使用 model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True) 但是会报错 “Traceback (most recent call last): File "/home/ubuntu/test_scripts/benchmark_r.py", line 154, in main() File "/home/ubuntu/test_scripts/benchmark_r.py", line 63, in main model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 556, in from_pretrained return model_class.from_pretrained( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained ) = cls._load_pretrained_model( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([2048, 2785]) in "weight" (which has shape torch.Size([2048, 5504])), this look incorrect. ” 我也尝试了在加载的时候添加参数 ignore_mismatched_sizes=True model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, ignore_mismatched_sizes=True)

同样也会报错: Some weights of QWenLMHeadModel were not initialized from the model checkpoint at /data/xxxx and are newly initialized because the shapes did not match:

请问你们在prune模型之后是怎么去加载的呢。 很着急尝试FLAP,期待您的回复,谢谢。

BenchuYee commented 3 months ago

hei JCDemon, if you want to load the model, you can use the pytorch api(torch.load, torch.save) instead of huggingface api(from_pretrained,save_pretrained).When you finish pruning the model, you use torch.save to save it and then use torch.load to load the pruned model.

JCDemon commented 3 months ago

hei JCDemon, if you want to load the model, you can use the pytorch api(torch.load, torch.save) instead of huggingface api(from_pretrained,save_pretrained).When you finish pruning the model, you use torch.save to save it and then use torch.load to load the pruned model.

I've solved the loading issue, thank you for your response. I was using FLAP to prune a llama-version of Qwen model (I converted the Qwen 1.8B model into llama2 version beforehand). I found that I can use save_pretrained to save the model in HF format. Then I convert the HF format pruned llama model into Qwen model and I found I couldn't load it using from_pretrained(). Fortunately, I managed to modify the qwen structure that is defined in modeling.py (by applying dynamic head_size to each layer, and then I finally can load the model using from_pretrained()). Currently, I got problem using the pruned model to generate text. The currently output of the model seems to be some garbled codes instead of correct sentences. I think it could be the reason I didn't modify the word embedding step (I checked the input_ids running in the model, it is correct. But when converted to input_embed, the input_embed seems totally wrong)? any suggestion about this?

2

the input_ids here is correct, but the inputs_embeds is wrong. I checked "self.wte" it is a nn.embedding object the above print inputs_embeds is torch.Size([1, 12, 2048])", but I think it should be "([1,12,768])"??? Bec I already set the hidden_size of the first layer to 768, I don't know why it is still the original value 2048.

Also, I couldn't use the above-mentioned pruned llama model (which is actually a llama-version Qwen model) to generate text (I used the exactly same code "torch.load" to load the model as you guys put in the github "generate.py"), I encountered similar "garbled codes" issue as mentioned earlier. BTW, the tokenizer I used is the Qwen tokenizer.

shwu-nyunai commented 3 months ago

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me.

9

JCDemon commented 3 months ago

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me. #9

sure thing, here is the code I used to prune the llama-version Qwen model and save in HF format. The line "model.save_pretrained(args.save_model, safe_serialization=True)" actually made it work.

"import argparse import os import numpy as np import torch from transformers import AutoTokenizer, AutoModelForCausalLM from models.hf_llama.modeling_llama import LlamaForCausalLM

from importlib.metadata import version

from lib.prune import prune_wanda_sp, prune_flap, prune_magnitude_sp, check_sparsity from lib.eval import eval_ppl

print('torch', version('torch')) print('transformers', version('transformers')) print('accelerate', version('accelerate')) print('# of gpus: ', torch.cuda.device_count())

def get_llm(model, cache_dir="llm_weights"):

model = AutoModelForCausalLM.from_pretrained(

#     model, 
#     torch_dtype=torch.float16, 
#     cache_dir=cache_dir, 
#     low_cpu_mem_usage=True, 
#     device_map="auto"
# )
model = LlamaForCausalLM.from_pretrained(
    model, 
    torch_dtype=torch.float16, 
    cache_dir=cache_dir, 
    low_cpu_mem_usage=True, 
    # device_map="auto"
)
print(len(model.model.layers))
for i in range(len(model.model.layers)):
    model.model.layers[i].self_attn.o_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].self_attn.o_proj.bias, device='cpu'))  # 或 'cuda'
    model.model.layers[i].mlp.down_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].mlp.down_proj.bias, device='cpu'))  # 或 'cuda'
    torch.nn.init.zeros_(model.model.layers[i].self_attn.o_proj.bias)
    torch.nn.init.zeros_(model.model.layers[i].mlp.down_proj.bias)

model.seqlen = 128
return model

def main(): parser = argparse.ArgumentParser() parser.add_argument('--model', type=str, help='LLaMA model') # Huggingface model name parser.add_argument('--seed', type=int, default=0, help='Seed for sampling the calibration data.') parser.add_argument('--nsamples', type=int, default=2048, help='Number of calibration samples.') parser.add_argument('--pruning_ratio', type=float, default=0, help='Pruning ratio.') parser.add_argument('--remove_heads', type=int, default=8, help='Remove num_heads') parser.add_argument("--metrics", type=str, default="WIFV", choices=["IFV", "WIFV", "WIFN", 'N/A']) parser.add_argument("--structure", type=str, default="AL-AM", choices=["UL-UM", "UL-MM", "AL-MM", "AL-AM", 'N/A']) parser.add_argument("--prune_method", type=str, default="flap", choices=["flap", "wanda_sp", "mag_sp"]) parser.add_argument("--cache_dir", default="llm_weights", type=str) parser.add_argument('--unstr', action="store_true") parser.add_argument('--eval', action="store_true") parser.add_argument('--save_model', type=str, default=None, help='Path to save the pruned model.') args = parser.parse_args()

# Setting seeds for reproducibility
np.random.seed(args.seed)
torch.random.manual_seed(args.seed)

# Build the model and tokenizer
print(f"loading llm model {args.model}")
model = get_llm(args.model, args.cache_dir)
device = torch.device("cuda:0")
model.to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False, trust_remote_code=True)

if "30b" in args.model or "65b" in args.model: # for 30b and 65b we use device_map to load onto multiple A6000 GPUs, thus the processing here.
    device = model.hf_device_map["lm_head"]
print("use device ", device)

# Prune the model
print("pruning starts")
if args.prune_method == "flap":
    if args.metrics == 'N/A':
        raise ValueError("For FLAP pruning, the metrics parameter must be chosen from ['IFV', 'WIFV', 'WIFN']. 'N/A' is not a valid choice.")  
    if args.structure == 'N/A':
        raise ValueError("For FLAP pruning, the compressed model structure parameter must be chosen from ['UL-UM', 'UL-MM', 'AL-MM', 'AL-AM']. 'N/A' is not a valid choice.")  
    prune_flap(args, model, tokenizer, device)
elif args.prune_method == "wanda_sp":
    prune_wanda_sp(args, model, tokenizer, device)
elif args.prune_method == "mag_sp":
    prune_magnitude_sp(args, model, tokenizer, device)

# Check the sparsity of the model
print("*"*30)
sparsity_ratio = check_sparsity(model)
print(f"sparsity sanity check {sparsity_ratio:.4f}")
print(f"model parameter {sum(p.numel() for p in model.parameters()) / 1000 ** 3:.2f}B")
print("*"*30)
# Evaluate the model
if args.eval:
    ppl = eval_ppl(model, tokenizer, device)    
    print(f"ppl on wikitext {ppl}")

# Save the model
if args.save_model:
    if not os.path.exists(args.save_model):
        os.makedirs(args.save_model)
    # torch.save(model, f'{args.save_model}/pruned_model.pt')    
    # torch.save(model, f'{args.save_model}/pruned_model.bin')
    model.save_pretrained(args.save_model, safe_serialization=True)
    # tokenizer.save_pretrained(args.save_model)

if name == 'main': main()"

JCDemon commented 3 months ago

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me. #9

I think that the reason I can use "save_pretrained" to save my model might bec it's not a official llama2 model. The model I used was converted from Qwen 1.8B. Also, I think the issue u mentioned in #9 doesn't make sense as the weight_dict should be modified after pruning. It's weird to output the original model. Maybe you can try checking the weight before using save_pretrained?