c0sogi / llama-api

An OpenAI-like LLaMA inference API
MIT License
111 stars 9 forks source link

Long generations dont return data but server says 200 OK. Swagger screen just says LOADING forever. #18

Open Dougie777 opened 12 months ago

Dougie777 commented 12 months ago

How to reproduce:

1) Model being used:

wizardlm_70b_q4_gguf = LlamaCppModel( model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download max_total_tokens=4096, use_mlock=False, )

2) From swagger run this query against the chat completion endpoint. Please note there are backslashes in front of quotes { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'." } ], "model": "wizardlm_70b_q4_gguf" }

3) When the server completes the query it says:

llama_print_timings: load time = 70698.17 ms llama_print_timings: sample time = 353.01 ms / 861 runs ( 0.41 ms per token, 2439.01 tokens per second) llama_print_timings: prompt eval time = 56156.99 ms / 95 tokens ( 591.13 ms per token, 1.69 tokens per second) llama_print_timings: eval time = 920273.58 ms / 860 runs ( 1070.09 ms per token, 0.93 tokens per second) llama_print_timings: total time = 978060.67 ms [2023-09-17 15:00:28,909] llama_api.server.pools.llama:INFO - 🦙 [done for wizardlm_70b_q4_gguf]: (elapsed time: 978.1s | tokens: 860( 0.9tok/s)) INFO: 216.8.141.240:47056 - "POST /v1/chat/completions HTTP/1.1" 200 OK doug@Ubuntu-2204-jammy-amd64-base:~/llama-api$

4) The swagger call still says LOADING infinitely

image

Dougie777 commented 12 months ago

Running curl locally on the server via curl works:

curl --location 'http://65.108.38.133:8000/v1/chat/completions' --header 'Content-Type: application/json' --header 'Accept: application/json' --data '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "The topic is: '\''Infant baptism is not biblical'\''. Give me at least 5 points. Output a table with these 4 columns: '\''Point For Title Sentence'\'','\''Point For Explanation with quotes and examples (min 5 sentences)'\'', '\''Point Against Title Sentence'\'','\''Point Against Explanation with quotes and examples (min 5 sentences)'\''." } ], "model": "wizardlm_70b_q4_gguf" }'

Dougie777 commented 12 months ago

curl is also working on my local server. So its looking more like the problem must be on my end. I will delete this issue once I get to the bottom of it.

Dougie777 commented 12 months ago

Ok this is weird. Although the code does work using curl, it has the same bug from python using completely different libraries (requests). I have duplicated the bug on multiple computers. Yet somehow curl on the command line bypasses the bug. Here is the python that gives the same bug:

import requests
import json

url = "http://65.108.38.133:8000/v1/chat/completions"

headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json'
}

data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'."
        }
    ],
    "model": "wizardlm_70b_q4_gguf"
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    result = response.json()
    print(result)
else:
    print(f"Error {response.status_code}: {response.text}")
Dougie777 commented 12 months ago

So the only solution that works is a python script watching a jobq and executing curl scripts. The python script I posted above does not work but a python script which runs curl as a subprocess does work. I am not sure if there is a problem with llama-api or not. I will leave this open for the time being.

c0sogi commented 12 months ago

Probably due to the too long response time of the 70B model, a timeout might be occurring and the swagger ui and requests module are not getting a response normally. The main difference between curl and requests is header. The latter one will add "Connection": "keep-alive" header automatically. Is there any proxy or load balancer in your network services?

I tried to reproduce the situation by adding code to the chat completion endpoint to wait 900 seconds, and found that the both swagger and your code responded normally. I've tested both windows 11 and ubuntu 22.04LTS local environment.