Rupesh-rkgit / FineTuning-and-Inference-Llama2

Finetuning and Inference of Llama2 7b model on colab
14 stars 0 forks source link

Llama 2 7B model Inference time issue #1

Open Rahu218 opened 11 months ago

Rahu218 commented 11 months ago

hi, How do i improve the inference time of my Llama2 7B model?....

i used BitsAndBytesConfig also but this does not seem to fasten the inference time!

code: `name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )

model = AutoModelForCausalLM.from_pretrained( name, device_map="auto", quantization_config=bnb_config, trust_remote_code=True, load_in_8bit=True, )

generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1, do_sample=True, eos_token_id=tokenizer.eos_token_id, device_map="auto", # finds GPU max_length=2000, top_k=10, top_p=0.9, temperature = 0.8, batch_size=1, )

llm = HuggingFacePipeline(pipeline = generation_pipe)`

Rupesh-rkgit commented 11 months ago

Checkout llama.cpp