Closed aswanthkrishna closed 2 months ago
Hi! Thanks for your response. Can you show me the configuration of your generation code (ex. temperature, top_k, top_p) so that I could reproduce your result?
The output i reported is without any LoRA finetuning, using this script python SVDLLM.py --model jeffwan/llama-7b-hf --step 1 --ratio 0.2 --whitening_nsamples 256 --dataset wikitext2 --model_seq_len 2048 --save_path ./ --run_low_resource
I have added the script i am using for generation.
def generate_response(prompt, model, model_path, device, max_length=100, temperature=0.1):
if args.model_path == "original":
model, tokenizer = get_model_from_huggingface(args.model)
else:
model, tokenizer = get_model_from_local(args.model_path)
if args.lora is not None:
from utils.peft import PeftModel
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_v_proj",
"q_u_proj",
"k_v_proj",
"k_u_proj",
"v_u_proj",
"v_v_proj",
"o_u_proj",
"o_v_proj",
"gate_u_proj",
"gate_v_proj",
"down_u_proj",
"down_v_proj",
"up_u_proj",
"up_v_proj"
],
lora_dropout=0,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config).to(device)
state_dict = load_file('/SVD-LLM/llama_7b_whitening_0.5_lora.pt/checkpoint-3200/model.safetensors')
set_peft_model_state_dict(model, state_dict)
model = PeftModel.from_pretrained(
model,
args.lora,
torch_dtype=torch.float16,
)
model = model.merge_and_unload()
model = model.to(device)
model.eval()
if device =='cpu':
model = model.float()
inputs = tokenizer(prompt, return_tensors="pt", max_length=max_length, truncation=True).to(device)
print(model, tokenizer)
outputs = model.generate(**inputs, max_length=max_length, temperature=temperature, do_sample=True)
print(outputs)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--model', type=str, default='jeffwan/llama-7b-hf',
help='LLaMA model to load, pass `jeffwan/llama-7b-hf`'
)
parser.add_argument(
'--model_path', type=str, default=None,
help='local compressed model path or whitening information path'
)
parser.add_argument(
'--prompt', type=str, default="",
help='input the prompt'
)
parser.add_argument(
'--DEV', type=str, default="cuda",
help='device'
)
parser.add_argument(
'--lora', type=str, default=None,
help='device'
)
args = parser.parse_args()
response = generate_response(prompt=args.prompt, model=args.model, model_path=args.model_path, device=args.DEV)
print(response)
@tuidan your help will be much appreciated. I am unable to figure out it if I am doing something wrong 🙂
Hi! Thank you for sharing the code. We strongly recommend to set the temperature value to near 1 (ex. 0.97) rather than near 0. Setting the value to near 0 will lead to generating many repeated words.
This is the code that we used for generation during our test:
generation_output = model.generate( input_ids=input_ids, do_sample=True, top_k=50, max_length=128, top_p=0.95, temperature=0.97 )
Please feel free to contact me if you have any other questions!
I tried to use the provided scripts to compress LLAMA 2 with 0.2 compression ratio. The model evaluation script shows a perplexity of 7.2 on wikitext, but the model responses are mostly incoherent. I am getting responses like
Instruction: tell me about you==\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ selecting\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
where as original model is giving decent responses.
Is there any modification to be done for the inference script or the tokeniser after model compression? , Is there an inference script within the repository?
Thanks for your help