I have launched asynchronous calls to a BentoServer deployed with a vllm backend on K8S.
I have loaded a codellama 13B in float 16.
An error occur during the call :
| File "/openllm-python/src/openllm/_service.py", line 28, in generate_stream_v1 │
│ async for it in llm.generate_iterator(llm_model_class(input_dict).model_dump()): │
│ File "/openllm-python/src/openllm/_llm.py", line 125, in generate_iterator │
│ raise RuntimeError(f'Exception caught during generation: {err}') from err │
│ RuntimeError: Exception caught during generation: Response payload is not completed
If I load the same model in float 32, the error does not occur.
Could you please help me to understand why this error appears?
Describe the bug
I have launched asynchronous calls to a BentoServer deployed with a vllm backend on K8S.
I have loaded a codellama 13B in float 16.
An error occur during the call :
| File "/openllm-python/src/openllm/_service.py", line 28, in generate_stream_v1 │ │ async for it in llm.generate_iterator(llm_model_class(input_dict).model_dump()): │ │ File "/openllm-python/src/openllm/_llm.py", line 125, in generate_iterator │ │ raise RuntimeError(f'Exception caught during generation: {err}') from err │ │ RuntimeError: Exception caught during generation: Response payload is not completed
If I load the same model in float 32, the error does not occur.
Could you please help me to understand why this error appears?
Many thanks!
To reproduce
No response
Logs
No response
Environment
K8S Python 3.10
System information (Optional)
No response