VincyZhang / intel-extension-for-transformers

Extending Hugging Face transformers APIs for Transformer-based models and improve the productivity of inference deployment. With extremely compressed models, the toolkit can greatly improve the inference efficiency on Intel platforms.
Apache License 2.0
0 stars 0 forks source link

422 Unprocessable Entity using Neural Chat via OpenAI interface with meta--lama/llama-2-7b-chat-hf #9

Closed VincyZhang closed 4 months ago

VincyZhang commented 4 months ago

Is there a specific version of openai that is aligned with the OpenAI interfaces offered by neuralchat? I am currently testing using the current 1.12.0 but encountering a 422 Unprocessable Entity error.

I saw that meta-llama/Llama-2-7b-chat-hf is a supported model and appears to be small enough to fit into my Intel Data Center Flex 170 XPU.

I can successfully run this model locally with the code outlined in deploy_chatbot_on_xpu.

However, when I attempt to use the OpenAI interface per the instructions at https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat, the server shows 422 Unprocessable Entity and the client gets an error about a missing value. I am assuming this relates to a mismatch between the OpenAI client and the neural_chat server in terms of the required fields. I have also included the text extracted from the tcpdump below.

Following along from the notebook examples, I have prepared textbot.yaml and server.py as below.

Starting the server

$ grep -v "^#" textbot.yaml | grep -v "^$"
host: 0.0.0.0
port: 8000
model_name_or_path: "meta-llama/Llama-2-7b-chat-hf"
device: "xpu"
tasks_list: ['textchat']

$ cat server.py
#!/usr/bin/env python

import os
import time
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio

nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="textbot.yaml", log_file="neuralchat.log")
multiprocessing.Process(target=start_service).start()

$ ./server.py
/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Loading config settings from the environment...
2024-02-19 14:11:22.837584: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-19 14:11:22.841047: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-19 14:11:22.887207: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 14:11:22.887246: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 14:11:22.888669: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 14:11:22.896900: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-19 14:11:22.897194: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-19 14:11:23.782914: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-19 14:11:27,327 - datasets - INFO - PyTorch version 2.1.0a0+cxx11.abi available.
2024-02-19 14:11:27,328 - datasets - INFO - TensorFlow version 2.15.0.post1 available.
Loading model meta-llama/Llama-2-7b-chat-hf
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.25it/s]
2024-02-19 14:11:31,912 - root - INFO - Model loaded.
INFO:     Started server process [2913373]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Additional logs after starting the TextChatClientExecutor client - successful inference

[2024-02-19 14:32:57,683] [    INFO] - Checking parameters of completion request...
[2024-02-19 14:32:57,683] [    INFO] - Predicting chat completion using prompt 'Tell me about Intel Xeon Scalable Processors.'
[2024-02-19 14:33:07,119] [    INFO] - Chat completion finished.
INFO:     127.0.0.1:60734 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Additional logs after connecting via OpenAI - failing access

INFO:     127.0.0.1:39368 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity

Open AI Client contents

Aside from the shebang and the modified model string, this should be identical to the content on the webpage.

$ cat openai-client.py
#!/usr/bin/env python

import openai
openai.api_key = "EMPTY"
openai.base_url = 'http://127.0.0.1:8000/v1/'

response = openai.chat.completions.create(
      model="meta-llama/Llama-2-7b-chat-hf",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."},
      ],
)
print(response.choices[0].message.content)

$ ./openai-client.py
Traceback (most recent call last):
  File "/home/REDACTED/jupyter/./openai-client.py", line 7, in <module>
    response = openai.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/REDACTED/miniconda3/envs/openai/lib/python3.11/site-packages/openai/_utils/_utils.py", line 275, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/REDACTED/miniconda3/envs/openai/lib/python3.11/site-packages/openai/resources/chat/completions.py", line 663, in create
    return self._post(
           ^^^^^^^^^^^
  File "/home/REDACTED/miniconda3/envs/openai/lib/python3.11/site-packages/openai/_base_client.py", line 1200, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/REDACTED/miniconda3/envs/openai/lib/python3.11/site-packages/openai/_base_client.py", line 889, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/home/REDACTED/miniconda3/envs/openai/lib/python3.11/site-packages/openai/_base_client.py", line 980, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.UnprocessableEntityError: Error code: 422 - {'detail': [{'loc': ['body', 'prompt'], 'msg': 'field required', 'type': 'value_error.missing'}]}

Text from packet capture of exchange

POST /v1/chat/completions HTTP/1.1
Host: REDACTED:8000
Accept-Encoding: gzip, deflate
Connection: keep-alive
Accept: application/json
Content-Type: application/json
User-Agent: _ModuleClient/Python 1.12.0
X-Stainless-Lang: python
X-Stainless-Package-Version: 1.12.0
X-Stainless-OS: Linux
X-Stainless-Arch: x64
X-Stainless-Runtime: CPython
X-Stainless-Runtime-Version: 3.11.7
Authorization: Bearer EMPTY
X-Stainless-Async: false
Content-Length: 197

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}], "model": "meta-llama/Llama-2-7b-chat-hf"}

HTTP/1.1 422 Unprocessable Entity
date: Mon, 19 Feb 2024 23:02:02 GMT
server: uvicorn
content-length: 90
content-type: application/json

{"detail":[{"loc":["body","prompt"],"msg":"field required","type":"value_error.missing"}]}

Thank you!