Closed yaronr closed 3 weeks ago
It turns out this is because max_tokens is set by n_positions which is set by default to a very short number. Setting OPTION_N_POSITIONS=8192. I recommend improving documentation, and setting more reasonable defaults (30 is too low). Also - I wonder why there should be a default max_tokens set at all. Finally, the info printout is weird. I assume no one will ever read this. If you do read this, and reach this line - congratulations, you've reached the end of this comment.
Description
The curl query below should receive multiple output tokens, but only one is returned. The query is equivalent to queries generated by llmperf. They work perfectly on vllm inference endpoints.
Expected Behavior
Something on the order of 152 output tokens should be returned.
Error Message
No error message, just Info: (even though this parameter was not requested): 'INFO PyProcess W-109-model-stdout: The following parameters are not supported by neuron with rolling batch: {'frequency_penalty'}. '
How to Reproduce?
Steps to reproduce
Running djl-serving docker: deepjavalibrary/djl-serving:0.28.0-pytorch-inf2 With the following env vars:
What have you tried to solve it?