Open sekian opened 4 months ago
In order to host the LLM model to be available easily, LM Studio can be leveraged. Everything can be handled through LM Studio application.
http://localhost:1234/v1/chat/completions
. Query it to get your LLM response.Using CURL:
curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "bartowski/Phi-3.1-mini-4k-instruct-GGUF",
"messages": [
{ "role": "system", "content": "Always answer in rhymes." },
{ "role": "user", "content": "Introduce yourself." }
],
"temperature": 0.7,
"max_tokens": -1,
"stream": false
}'
From Python with requests library:
import requests
import json
# Define the API endpoint URL and headers for JSON content type
url = 'http://localhost:1234/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
# Prepare the data payload in a dictionary format with proper indentation (use json.dumps)
data_payload = {
"model": "bartowski/Phi-3.1-mini-4k-instruct-GGUF",
"messages": [
{"role": "system", "content": "Always answer in rhymes."},
{"role": "user", "content": "Introduce yourself."}
],
"temperature": 0.7,
"max_tokens": -1,
"stream": False
}
# Send the POST request to the API endpoint with data payload and headers using json=True parameter in requests library
response = requests.post(url, headers=headers, json=json.dumps(data_payload))
# Print response text for debugging or further processing purposes (if needed)
print(response.text)
From Python with OpenAI interface:
from openai import OpenAI
# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
completion = client.chat.completions.create(
model="bartowski/Phi-3.1-mini-4k-instruct-GGUF",
messages=[
{"role": "system", "content": "Always answer in rhymes."},
{"role": "user", "content": "Introduce yourself."}
],
temperature=0.7,
)
print(completion.choices[0].message)
As to be expected, since LM Studio hosts the Server, there is no need for the Flask with this option. Also, LM Studio will handle the GPU support.
This code is uses llama-cpp-python to locally run the quantized GGUF
models. In the repository, they provide easy instructions of how to install with GPU support.
Running the LLM with GPU compute is not required, but rather a nice to have.
In my case, I installed CUDA 12.1 and the already built wheel for llama-cpp-python
which avoids building from source:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
Adapted our LocalLLM code to the llama_cpp
library. This code below replaces the lllm.py file.
from llama_cpp import Llama
class LocalLLM:
_instance = None
def __init__(self, model_path, model_filename):
# Load the tokenizer and model from the local paths
self.model = Llama.from_pretrained(
repo_id=model_path,
filename=model_filename,
verbose=False,
main_gpu=0,
n_gpu_layers=2048,
n_ctx=4096,
)
@staticmethod
def get_instance():
if LocalLLM._instance is None:
model_path = "bartowski/Phi-3.1-mini-4k-instruct-GGUF"
model_filename = "Phi-3.1-mini-4k-instruct-Q8_0_L.gguf"
LocalLLM._instance = LocalLLM(model_path, model_filename)
return LocalLLM._instance
def generate_response(self, messages, max_new_tokens=2048):
response = self.model.create_chat_completion(messages=messages)
parsedText = response["choices"][0]["message"]["content"]
messages.append({"role": "assistant", "content": parsedText})
return messages
# Example usage
if __name__ == "__main__":
model_path = "bartowski/Phi-3.1-mini-4k-instruct-GGUF"
model_filename = "Phi-3.1-mini-4k-instruct-Q8_0_L.gguf"
llm = LocalLLM(model_path, model_filename)
messages = [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits? Reply funny"},
]
response = llm.generate_response(messages)
print(response)
As to be expected to run the App from the Flask Server, we need to install the requirements from the file found in the delivery:
pip install -r requirements.txt
This is the code in the repository. It uses torch and transformers library. To run on GPU, we installed CUDA 12.1 and torch wheel for it. This approach may run slower than the other two solutions above if the VRAM requirements are not meet.
Running the LLM with GPU compute is not required, but rather a nice to have.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
LLM module source code from the Phase 1 delivery
As to be expected to run the App from the Flask Server, we need to install the requirements from the file found in the delivery:
pip install -r requirements.txt
I have dockerized the CPU and GPU versions of the Llama implementation. The Dockerfile can be optimized and generalized. The Dockerfile is to be run from the geekle_ia_models
folder. I encountered many issues. I fixed them and I have been able to successfully run the AI service dockerized, but it is not optimal.
FROM python:3.10.14-bookworm
COPY requirements.txt ./
RUN pip install -r requirements.txt
RUN apt-get update && \
apt-get install --no-install-recommends -y python3-pip python3-dev ffmpeg libsm6 libxext6 gcc g++ musl-dev && \
rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/lib/x86_64-linux-musl/libc.so /lib/libc.musl-x86_64.so.1
RUN pip install huggingface_hub
COPY . ./
ENTRYPOINT ["python", "run.py"]
Then build docker build -t phi3_cpu .
and run docker run -it -p 5000:5000 phi3_cpu
the docker.
The docker will try to download the model on the fly, which can take a lot of time.
Llama.from_pretrained(
...
)
To avoid that the model can be included in the docker image and loaded from local file:
Llama(
model_path="../Phi-3.1-mini-4k-instruct-Q8_0_L.gguf",
...
)
I plan to include the Dockerfile
with GPU support later, although it is not required to run the AI Service (just a nice to have so that it runs faster)
This issue is to explain how to host locally the LLM model.
For all the solutions listed below,
ngrok.com
(or any similar tool) can be used to share the local AI server to other people.We have used
Phi-3.1-mini-4k-instruct
but observed similar results with less resource intensive versions of the model.