bug: RuntimeError: Exception caught during generation: Response payload is not completed

Describe the bug

I have launched asynchronous calls to a BentoServer deployed with a vllm backend on K8S.

I have loaded a codellama 13B in float 16.

An error occur during the call :

| File "/openllm-python/src/openllm/_service.py", line 28, in generate_stream_v1 │ │ async for it in llm.generate_iterator(llm_model_class(input_dict).model_dump()): │ │ File "/openllm-python/src/openllm/_llm.py", line 125, in generate_iterator │ │ raise RuntimeError(f'Exception caught during generation: {err}') from err │ │ RuntimeError: Exception caught during generation: Response payload is not completed

If I load the same model in float 32, the error does not occur.

Could you please help me to understand why this error appears?

Many thanks!

To reproduce

No response

Logs

No response

Environment

K8S Python 3.10

System information (Optional)

No response

bentoml / OpenLLM