Open Red-Caesar opened 9 months ago
At 01.10 we have the following situation with endpoints and the number of chat completion: table
I found the cause of my problem when I sent parameter n
through a request and received:
{'object': 'error', 'message': "1 validation error for SamplingParams\n Value error, best_of must be 1 when using greedy sampling.Got 10. [type=value_error, input_value={'n': 10, 'presence_penal...rue, 'logit_bias': None}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.5/v/value_error", 'type': 'invalid_request_error', 'param': None, 'code': None}
The reason was that the default temperature, I was sending, was 0.0. So, it didn't work. But it works with the other temperature.
However, I think the error message of this case is a bit confusing.
Currently, there are two things that confuse me.
Sending requests
Firstly, it's about sending a lot of request to server and waited for correct response. Tests for this cases looks like:
num_workers
is amount of sending requests. The problem is that when the number of requests is more than 64, we get error 502 (Bad Gateway) on a lot of endpoints, especially on prod. We can see it in this table on last raw. So the tests, which are based on the same logic, also fall atsending requests
> 64.In total: It doesn't seem right that the server can't handle that amount of requests.
Number of chat completion.
Another confusing problem with the parameter - number of chat completion. When I work through this parameter via the openal api, I don't have a problem with n > 1000. For example, the test like this (with prod-codellama-7b-instruct-fp16):
It only fell when n = 2500 with error 400:
But if I try to send n > 1 via requests:
Where
model_data
is just:I got (with prod-codellama-7b-instruct-fp16):
However, there are more strange things with this. Other tests work well at n > 1 (for example, n=500), but fail at n > 1000 with an error:
In total: Currently, it is very unclear how the
n
should behave in the end. What limitations does it have? Why, if you make requests differently, then it behaves differently?