AIoT-MLSys-Lab / SVD-LLM

Official Code for "SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression"
https://arxiv.org/abs/2403.07378
Apache License 2.0
68 stars 6 forks source link

Incorrect Model Responses after compression #5

Closed aswanthkrishna closed 2 months ago

aswanthkrishna commented 3 months ago

I tried to use the provided scripts to compress LLAMA 2 with 0.2 compression ratio. The model evaluation script shows a perplexity of 7.2 on wikitext, but the model responses are mostly incoherent. I am getting responses like

Instruction: tell me about you==\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ selecting\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

where as original model is giving decent responses.

Is there any modification to be done for the inference script or the tokeniser after model compression? , Is there an inference script within the repository?

Thanks for your help

tuidan commented 3 months ago

Hi! Thanks for your response. Can you show me the configuration of your generation code (ex. temperature, top_k, top_p) so that I could reproduce your result?

aswanthkrishna commented 3 months ago

The output i reported is without any LoRA finetuning, using this script python SVDLLM.py --model jeffwan/llama-7b-hf --step 1 --ratio 0.2 --whitening_nsamples 256 --dataset wikitext2 --model_seq_len 2048 --save_path ./ --run_low_resource

I have added the script i am using for generation.

def generate_response(prompt, model, model_path, device, max_length=100, temperature=0.1):
    if args.model_path == "original":
        model, tokenizer = get_model_from_huggingface(args.model)
    else:
        model, tokenizer = get_model_from_local(args.model_path)

    if args.lora is not None:
        from utils.peft import PeftModel

        config = LoraConfig(
                    r=8,
                   lora_alpha=16,
                  target_modules=[
                    "q_v_proj",
                    "q_u_proj",
                    "k_v_proj",
                    "k_u_proj",
                    "v_u_proj",
                    "v_v_proj",
                    "o_u_proj",
                    "o_v_proj",
                    "gate_u_proj",
                    "gate_v_proj",
                    "down_u_proj",
                    "down_v_proj",
                    "up_u_proj",
                    "up_v_proj"
                ],
               lora_dropout=0,
              bias="none",
              task_type="CAUSAL_LM",
           )
        model = get_peft_model(model, config).to(device)
        state_dict = load_file('/SVD-LLM/llama_7b_whitening_0.5_lora.pt/checkpoint-3200/model.safetensors')
        set_peft_model_state_dict(model, state_dict)
        model = PeftModel.from_pretrained(
            model,
            args.lora,
            torch_dtype=torch.float16,
        )
        model = model.merge_and_unload()

    model = model.to(device)
    model.eval()
    if device =='cpu':
        model = model.float() 
    inputs = tokenizer(prompt, return_tensors="pt", max_length=max_length, truncation=True).to(device)
    print(model, tokenizer)
    outputs = model.generate(**inputs, max_length=max_length, temperature=temperature, do_sample=True)

    print(outputs)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--model', type=str, default='jeffwan/llama-7b-hf',
        help='LLaMA model to load, pass `jeffwan/llama-7b-hf`'
    )
    parser.add_argument(
        '--model_path', type=str, default=None,
        help='local compressed model path or whitening information path'
    )
    parser.add_argument(
        '--prompt', type=str, default="",
        help='input the prompt'
    )
    parser.add_argument(
        '--DEV', type=str, default="cuda", 
        help='device'
    )
    parser.add_argument(
         '--lora', type=str, default=None, 
         help='device'
     )
    args = parser.parse_args()
    response = generate_response(prompt=args.prompt, model=args.model, model_path=args.model_path, device=args.DEV)
    print(response)
aswanthkrishna commented 2 months ago

@tuidan your help will be much appreciated. I am unable to figure out it if I am doing something wrong 🙂

tuidan commented 2 months ago

Hi! Thank you for sharing the code. We strongly recommend to set the temperature value to near 1 (ex. 0.97) rather than near 0. Setting the value to near 0 will lead to generating many repeated words.

This is the code that we used for generation during our test: generation_output = model.generate( input_ids=input_ids, do_sample=True, top_k=50, max_length=128, top_p=0.95, temperature=0.97 )

Please feel free to contact me if you have any other questions!