Local hosting of AI service

sekian commented 4 months ago

This issue is to explain how to host locally the LLM model.

For all the solutions listed below, ngrok.com (or any similar tool) can be used to share the local AI server to other people.

We have used Phi-3.1-mini-4k-instruct but observed similar results with less resource intensive versions of the model.

sekian commented 4 months ago

LM Studio Local Server

In order to host the LLM model to be available easily, LM Studio can be leveraged. Everything can be handled through LM Studio application.

Search and download your model of choice from LM Studio.
Switch to the Local Server tab, select your model and start the server.
By default, the endpoint to query will be: http://localhost:1234/v1/chat/completions. Query it to get your LLM response.

Using CURL:

curl http://localhost:1234/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "bartowski/Phi-3.1-mini-4k-instruct-GGUF",
    "messages": [
      { "role": "system", "content": "Always answer in rhymes." },
      { "role": "user", "content": "Introduce yourself." }
    ],
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": false
}'

From Python with requests library:

import requests
import json

# Define the API endpoint URL and headers for JSON content type
url = 'http://localhost:1234/v1/chat/completions'
headers = {'Content-Type': 'application/json'}

# Prepare the data payload in a dictionary format with proper indentation (use json.dumps)
data_payload = {
    "model": "bartowski/Phi-3.1-mini-4k-instruct-GGUF",
    "messages": [
        {"role": "system", "content": "Always answer in rhymes."},
        {"role": "user", "content": "Introduce yourself."}
    ], 
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": False
}

# Send the POST request to the API endpoint with data payload and headers using json=True parameter in requests library
response = requests.post(url, headers=headers, json=json.dumps(data_payload))

# Print response text for debugging or further processing purposes (if needed)
print(response.text)

From Python with OpenAI interface:

from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

completion = client.chat.completions.create(
  model="bartowski/Phi-3.1-mini-4k-instruct-GGUF",
  messages=[
    {"role": "system", "content": "Always answer in rhymes."},
    {"role": "user", "content": "Introduce yourself."}
  ],
  temperature=0.7,
)

print(completion.choices[0].message)

As to be expected, since LM Studio hosts the Server, there is no need for the Flask with this option. Also, LM Studio will handle the GPU support.

sekian commented 4 months ago

Python Local Server with llama-cpp

This code is uses llama-cpp-python to locally run the quantized GGUF models. In the repository, they provide easy instructions of how to install with GPU support.

Running the LLM with GPU compute is not required, but rather a nice to have.

In my case, I installed CUDA 12.1 and the already built wheel for llama-cpp-python which avoids building from source:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

Adapted our LocalLLM code to the llama_cpp library. This code below replaces the lllm.py file.

from llama_cpp import Llama

class LocalLLM:
    _instance = None

    def __init__(self, model_path, model_filename):
        # Load the tokenizer and model from the local paths
        self.model = Llama.from_pretrained(
            repo_id=model_path,
            filename=model_filename,
            verbose=False,
            main_gpu=0, 
            n_gpu_layers=2048,
            n_ctx=4096,
        )

    @staticmethod
    def get_instance():
        if LocalLLM._instance is None:
            model_path = "bartowski/Phi-3.1-mini-4k-instruct-GGUF"
            model_filename = "Phi-3.1-mini-4k-instruct-Q8_0_L.gguf"
            LocalLLM._instance = LocalLLM(model_path, model_filename)
        return LocalLLM._instance

    def generate_response(self, messages, max_new_tokens=2048):
        response = self.model.create_chat_completion(messages=messages)

        parsedText = response["choices"][0]["message"]["content"]

        messages.append({"role": "assistant", "content": parsedText})

        return messages

# Example usage
if __name__ == "__main__":
    model_path = "bartowski/Phi-3.1-mini-4k-instruct-GGUF"
    model_filename = "Phi-3.1-mini-4k-instruct-Q8_0_L.gguf"

    llm = LocalLLM(model_path, model_filename)

    messages = [
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits? Reply funny"},
    ]
    response = llm.generate_response(messages)
    print(response)

As to be expected to run the App from the Flask Server, we need to install the requirements from the file found in the delivery:

pip install -r requirements.txt

sekian commented 4 months ago

Python Local Server with torch

This is the code in the repository. It uses torch and transformers library. To run on GPU, we installed CUDA 12.1 and torch wheel for it. This approach may run slower than the other two solutions above if the VRAM requirements are not meet.

Running the LLM with GPU compute is not required, but rather a nice to have.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

LLM module source code from the Phase 1 delivery

As to be expected to run the App from the Flask Server, we need to install the requirements from the file found in the delivery:

pip install -r requirements.txt

sekian commented 4 months ago

I have dockerized the CPU and GPU versions of the Llama implementation. The Dockerfile can be optimized and generalized. The Dockerfile is to be run from the geekle_ia_models folder. I encountered many issues. I fixed them and I have been able to successfully run the AI service dockerized, but it is not optimal.

Dockerized Python Local Server with llama-cpp (CPU)

FROM python:3.10.14-bookworm

COPY requirements.txt ./

RUN pip install -r requirements.txt

RUN apt-get update &&  \
    apt-get install --no-install-recommends -y python3-pip python3-dev ffmpeg libsm6 libxext6 gcc g++ musl-dev && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/lib/x86_64-linux-musl/libc.so /lib/libc.musl-x86_64.so.1

RUN pip install huggingface_hub

COPY . ./

ENTRYPOINT ["python", "run.py"]

Then build docker build -t phi3_cpu . and run docker run -it -p 5000:5000 phi3_cpu the docker.

The docker will try to download the model on the fly, which can take a lot of time.

Llama.from_pretrained(
    ...
)

To avoid that the model can be included in the docker image and loaded from local file:

Llama(
    model_path="../Phi-3.1-mini-4k-instruct-Q8_0_L.gguf",
    ...
)

I plan to include the Dockerfile with GPU support later, although it is not required to run the AI Service (just a nice to have so that it runs faster)

geekleteam / En_Plan_Quin_Pal_team_Hackathon_1