afr2903 / ai-battleworld

2 stars 0 forks source link

Gemma setup #3

Closed afr2903 closed 3 weeks ago

afr2903 commented 7 months ago

Gemma from Google setup

afr2903 commented 3 weeks ago

The more out-of-the-box setup from Gemma I found is by using the transformers lib directly.

I followed Hugging Face quickstart guide after reading the Gemma's cookbook for a while without fully understanding it.

afr2903 commented 3 weeks ago

Inference using HuggingFace Transformers library

The model I intended to run in my computer (laptop with GTX 1650) was gemma-2-2b-it from Hugging face

Env setup

I created a subfolder gemma and inside I created a Python virtual environment, then accessed it:

python -m venv gemma-env
source gemma-env/bin/activate

Within the environment, the following requirements were installed:

pip install torch
pip install -U transformers

Accesing the model via Hugging Face

I started following this guide. To be able to download model weights directly from transformers library, I had to create first a Hugging Face account, acknowledge Gemma user policies, and create a token with read access granted from public repositories (save the token key).

Afterwards, I logged into Hugging Face CLI with the token key copied:

huggingface-cli login

Running on a single GPU

I tried the first sample code, but ran into an error of not enough memory in my GPU:

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="google/gemma-2-2b-it",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda", 
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

Then I was able to inference it using CPU with the simplified snippet:

from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="google/gemma-2-2b-it")
pipe(messages)

Finally, I was able to use GPU with the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

input_text = "Which are the most common KPIs in manufacturing industries?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Future steps

I found out Hugging Face is slower in inference against Ollama, so probably I'll use Ollama for testing Llama 3.2

afr2903 commented 3 weeks ago

Results from test:

image