用openai的API库发送请求，返回HTTP400

saber28 commented 5 months ago

升级到v0.3.25后，用openai的API库发送请求，返回HTTP400。但在v0.3.20以下代码能正常工作代码：

from openai import AsyncOpenAI
import asyncio

output_log = '00_Translate_to_Chinese.log'

client = AsyncOpenAI(base_url = "http://localhost:65530/api/oai", api_key="JUSTSECRET_KEY")

async def translate(t):
    text = ""
    stream = await client.completions.create(
        model="rwkv5-7b-v2",
        prompt="\nInstruction: Translate the input text into Chinese\n\nInput: " + t + "\n\nResponse:",
        top_p=0.1,
        frequency_penalty=1,
        stop=['\x00','\n\n','User:'],
        stream=True)
    async for chunk in stream:
        try:
            print(chunk.choices[0].delta['content'],end="",flush=True)
            text += chunk.choices[0].delta['content']
        except:
            break
    print('\n')
    with open(output_log,'a',encoding='utf-8') as f:
        f.write(text + '\n')

def main():
    while True:
        t = input()
        if t == "":
            continue
        with open(output_log,'a',encoding='utf-8') as f:
            f.write(t + '\n')
#        print('\n')
        asyncio.run(translate(t))

main()

POST请求：

Path：/api/oai/completions

Header：

Host: 127.0.0.1:65530
Accept-Encoding: gzip, deflate
Connection: keep-alive
Accept: application/json
Content-Type: application/json
User-Agent: AsyncOpenAI/Python 1.17.1
X-Stainless-Lang: python
X-Stainless-Package-Version: 1.17.1
X-Stainless-OS: Windows
X-Stainless-Arch: other:amd64
X-Stainless-Runtime: CPython
X-Stainless-Runtime-Version: 3.12.2
Authorization: Bearer JUSTSECRET_KEY
X-Stainless-Async: async:asyncio
Content-Length: 215

Payload:

{"model": "rwkv5-7b-v2", "prompt": "\\nInstruction: Translate the input text into Chinese\\n\\nInput: Hi\\n\\nResponse:", "frequency_penalty": 1, "stop": ["\\u0000", "\\n\\n", "User:"], "stream": true, "top_p": 0.1}

回应：

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width">
    <title>400: Bad Request</title>
    <style>
    :root {
        --bg-color: #fff;
        --text-color: #222;
    }
    body {
        background: var(--bg-color);
        color: var(--text-color);
        text-align: center;
    }
    pre { text-align: left; padding: 0 1rem; }
    footer{text-align:center;}
    @media (prefers-color-scheme: dark) {
        :root {
            --bg-color: #222;
            --text-color: #ddd;
        }
        a:link { color: red; }
        a:visited { color: #a8aeff; }
        a:hover {color: #a8aeff;}
        a:active {color: #a8aeff;}
    }
    </style>
</head>
<body>
    <div><h1>400: Bad Request</h1><h3>parse http data failed.</h3><pre>There is no more detailed explanation.</pre><hr><footer><a href="https://salvo.rs" target="_blank">salvo</a></footer></div>
</body>
</html>

配置文件：

[model]
embed_device = "Gpu"                                     # Device to put the embed tensor ("Cpu" or "Gpu").
head_chunk_size = 8192                                   # DO NOT modify this if you don't know what you are doing.
max_batch = 1                                           # The maximum batches that are cached on GPU.
max_runtime_batch = 1                                    # The maximum batches that can be scheduled for inference at the same time.
model_name = "rwkv5-7b-v2.st" # Name of the model.
model_path = "assets/models"                             # Path to the folder containing all models.
quant = 32                                                # Layers to be quantized.
quant_type = "NF4"                                      # Quantization type ("Int8" or "NF4").
state_chunk_size = 4                                     # The chunk size of layers in model state.
stop = ["\n\n"]                                          # Additional stop words in generation.
token_chunk_size = 128                                   # Size of token chunk that is inferred at once. For high end GPUs, this could be 64 or 128 (faster).
turbo = true                                             # Whether to use alternative GEMM kernel to speed-up long prompts.

[tokenizer]
path = "assets/tokenizer/rwkv_vocab_v20230424.json" # Path to the tokenizer.

[bnf]
enable_bytes_cache = true   # Enable the cache that accelerates the expansion of certain short schemas.
start_nonterminal = "start" # The initial nonterminal of the BNF schemas.

[adapter]
Auto = {}

[listen]
acme = false
domain = "local"
ip = "0.0.0.0"   # Use IpV4.
# ip = "::"        # Use IpV6.
force_pass = true
port = 65530
slot = "permisionkey"
tls = false

[[listen.app_keys]] # Allow mutiple app keys.
app_id = "JUSTAISERVER"
secret_key = "JUSTSECRET_KEY"

控制台信息

2024-04-16T14:44:59.204Z INFO  [ai00_server] reading config assets/configs/Config.toml...
2024-04-16T14:44:59.206Z INFO  [ai00_server::middleware] ModelInfo {
    version: V5,
    num_layer: 32,
    num_emb: 4096,
    num_hidden: 14336,
    num_vocab: 65536,
    num_head: 64,
}
2024-04-16T14:44:59.207Z INFO  [ai00_server::middleware] type: SafeTensors
2024-04-16T14:44:59.279Z INFO  [ai00_server] server started at 0.0.0.0:65530 without tls
2024-04-16T14:44:59.467Z INFO  [ai00_server::middleware] AdapterInfo {
    name: "NVIDIA GeForce RTX 4070 Ti",
    vendor: 4318,
    device: 10114,
    device_type: DiscreteGpu,
    driver: "NVIDIA",
    driver_info: "551.86",
    backend: Vulkan,
}
2024-04-16T14:45:09.367Z INFO  [ai00_server::middleware] model reloaded

cgisky1980 commented 5 months ago

{
    "model": "rwkv5-7b-v2",
    "prompt": ["\\nInstruction: Translate the input text into Chinese\\n\\nInput: Hi\\n\\nResponse:"],
    "temperature":1,
    "frequency_penalty":0.3,
    "penalty_decay":0.9982686325973925,
    "stop": [
        "\\u0000",
        "\\n\\n",
        "User:"
    ],
    "stream": true,
    "top_p": 0.1
}

这个请求是OK的

cgisky1980 commented 5 months ago

一、prompt: 是String数组，二、temperature要有

saber28 commented 5 months ago

prompt应当是字符串列表 Prompt should be list of str

Ai00-X / ai00_server

用openai的API库发送请求，返回HTTP400 #101

POST请求：

回应：

配置文件：

控制台信息