bentoml / OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
9.89k stars 629 forks source link

bug: RuntimeError: Exception caught during generation: Response payload is not completed #852

Closed mfournioux closed 3 months ago

mfournioux commented 9 months ago

Describe the bug

I have launched asynchronous calls to a BentoServer deployed with a vllm backend on K8S.

I have loaded a codellama 13B in float 16.

An error occur during the call :

| File "/openllm-python/src/openllm/_service.py", line 28, in generate_stream_v1 │ │ async for it in llm.generate_iterator(llm_model_class(input_dict).model_dump()): │ │ File "/openllm-python/src/openllm/_llm.py", line 125, in generate_iterator │ │ raise RuntimeError(f'Exception caught during generation: {err}') from err │ │ RuntimeError: Exception caught during generation: Response payload is not completed

If I load the same model in float 32, the error does not occur.

Could you please help me to understand why this error appears?

Many thanks!

To reproduce

No response

Logs

No response

Environment

K8S Python 3.10

System information (Optional)

No response

bojiang commented 3 months ago

close for openllm 0.6