Open Rahu218 opened 11 months ago
hi, How do i improve the inference time of my Llama2 7B model?....
i used BitsAndBytesConfig also but this does not seem to fasten the inference time!
code: `name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )
model = AutoModelForCausalLM.from_pretrained( name, device_map="auto", quantization_config=bnb_config, trust_remote_code=True, load_in_8bit=True, )
generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1, do_sample=True, eos_token_id=tokenizer.eos_token_id, device_map="auto", # finds GPU max_length=2000, top_k=10, top_p=0.9, temperature = 0.8, batch_size=1, )
llm = HuggingFacePipeline(pipeline = generation_pipe)`
Checkout llama.cpp
hi, How do i improve the inference time of my Llama2 7B model?....
i used BitsAndBytesConfig also but this does not seem to fasten the inference time!
code: `name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )
model = AutoModelForCausalLM.from_pretrained( name, device_map="auto", quantization_config=bnb_config, trust_remote_code=True, load_in_8bit=True, )
generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1, do_sample=True, eos_token_id=tokenizer.eos_token_id, device_map="auto", # finds GPU max_length=2000, top_k=10, top_p=0.9, temperature = 0.8, batch_size=1, )
llm = HuggingFacePipeline(pipeline = generation_pipe)`