Closed afr2903 closed 3 weeks ago
The more out-of-the-box setup from Gemma I found is by using the transformers
lib directly.
I followed Hugging Face quickstart guide after reading the Gemma's cookbook for a while without fully understanding it.
The model I intended to run in my computer (laptop with GTX 1650) was gemma-2-2b-it
from Hugging face
I created a subfolder gemma
and inside I created a Python virtual environment, then accessed it:
python -m venv gemma-env
source gemma-env/bin/activate
Within the environment, the following requirements were installed:
pip install torch
pip install -U transformers
I started following this guide.
To be able to download model weights directly from transformers
library, I had to create first a Hugging Face account, acknowledge Gemma user policies, and create a token with read access granted from public repositories (save the token key).
Afterwards, I logged into Hugging Face CLI with the token key copied:
huggingface-cli login
I tried the first sample code, but ran into an error of not enough memory in my GPU:
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
messages = [
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
Then I was able to inference it using CPU with the simplified snippet:
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="google/gemma-2-2b-it")
pipe(messages)
Finally, I was able to use GPU with the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b-it",
device_map="auto",
torch_dtype=torch.bfloat16,
)
input_text = "Which are the most common KPIs in manufacturing industries?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
I found out Hugging Face is slower in inference against Ollama, so probably I'll use Ollama for testing Llama 3.2
Results from test:
Gemma from Google setup