How can I switch to local LLM engine

oppokui commented 1 year ago

In Chat UI, there is a long list of LLM model. The default one is GPT 3.5 Turbo, which is openAI as I guess. I configure openAI api key in .env, so it should be used, as the answer is very fast.

When I try to switch it to Llama 7B, it report:

An error occurred while generating text: Model llama-7b-GGML is currently booting.

I setup another llm engine "vllm" based on llama-2-7b-chat model, and expose in port 3000, it is compatible with openAI API. how can I configure it to use this new engine?

c0sogi commented 1 year ago

The model name will be facebook/opt-125m for example purposes.

In ./app/models/llms.py, find the LLMModels class. Then try adding this to the class members and reboot.

     my_model = OpenAIModel(
         name="facebook/opt-125m",
         max_total_tokens=4096;
         max_tokens_per_request=4096;
         token_margin=10;
         tokenizer=OpenAITokenizer("gpt-3.5-turbo"),
         api_url="http://localhost:3000/v1/chat/completions",
     )

In the case of tokenizer, since tiktoken(OpenAITokenizer) is used, token counting will not be accurate compared to vllm which uses llama tokenizer. However, if the vllm server side handles the token limit exceeding error well, you will be able to use it without any problems.

oppokui commented 1 year ago

It works! But meet one problem when uploading a txt file for similarity search.

I use the script to start vllm engine remotely:

pip install git+https://github.com/vllm-project/vllm.git
pip install fschat
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.5 \
  --dtype half \
  --host 0.0.0.0 \
  --port 3000

Then enrich llms.py: (the max tokens can't be 4096)

    llama_2_7b_vllm = OpenAIModel(
        name="meta-llama/Llama-2-7b-chat-hf",
        max_total_tokens=2048,
        max_tokens_per_request=2048,
        token_margin=8,
        tokenizer=OpenAITokenizer("gpt-4"),
        api_url="http://ec2-18-211-48-230.compute-1.amazonaws.com:3000/v1/chat/completions",
        api_key=OPENAI_API_KEY,
    )

I can ask "who are you?" to remote llama engine, and it respond as llama.

Screenshot from 2023-09-19 17-13-20

Then I want to upload txt file, it hang there. It works if I use gpt-3.5 or gpt4. The api container print logs like:

api_1          | [2023-09-19 09:00:28,729] ApiLogger:CRITICAL - 🦙 Llama.cpp server is running
api_1          | INFO:     ('172.16.0.1', 59542) - "WebSocket /ws/chat/daad0289-fc57-4e88-ada1-82052b94-d334-485d-a975-d386a605efd8" [accepted]
api_1          | INFO:     connection open
api_1          | - DEBUG: Calling command: retry with 0 args and ['buffer'] kwargs
api_1          | - DEBUG: remaining_tokens: 1528
api_1          | - DEBUG: Sending messages: 
api_1          | [
api_1          |   {
api_1          |     "role": "user",
api_1          |     "content": "who are you?"
api_1          |   }
api_1          | ]
api_1          | - DEBUG: Sending functions: None
api_1          | - DEBUG: Sending function_call: None
api_1          | Loading tokenizer:  gpt-4
api_1          | INFO:     172.16.0.1:48092 - "GET /assets/assets/lotties/file-upload.json HTTP/1.1" 200 OK
api_1          | Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/engines/text-embedding-ada-002/embeddings (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fde88725850>, 'Connection to api.openai.com timed out. (connect timeout=600)')).

oppokui commented 1 year ago

Oh, I realized the error is related to openai.com access. I can't access it in local machine, let me retry it in AWS instance.

c0sogi / LLMChat

How can I switch to local LLM engine #43