Deelvin / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
0 stars 2 forks source link

Unexpected failure on some parameters in tests #29

Open Red-Caesar opened 9 months ago

Red-Caesar commented 9 months ago

Currently, there are two things that confuse me.

Sending requests

Firstly, it's about sending a lot of request to server and waited for correct response. Tests for this cases looks like:

@pytest.mark.parametrize("num_workers", [64, 128])
def test_send_many_request(num_workers, model_name, token, endpoint):
    message = "Create a short story about a friendship between a cat and a dog."
    request = model_data(model_name, message, max_tokens=300)
    url = endpoint + "/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}",
    }
    responses_code_set = set()

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(send_request_get_response, url, request, headers)
            for _ in range(num_workers)
        ]
        for future in concurrent.futures.as_completed(futures):
            responses_code_set.add(future.result().status_code)

    assert responses_code_set == {200}

num_workers is amount of sending requests. The problem is that when the number of requests is more than 64, we get error 502 (Bad Gateway) on a lot of endpoints, especially on prod. We can see it in this table on last raw. So the tests, which are based on the same logic, also fall at sending requests> 64.

In total: It doesn't seem right that the server can't handle that amount of requests.

Number of chat completion.

Another confusing problem with the parameter - number of chat completion. When I work through this parameter via the openal api, I don't have a problem with n > 1000. For example, the test like this (with prod-codellama-7b-instruct-fp16):

@pytest.mark.parametrize("n", [1000, 1500, 2000, 2200, 2300, 2500])
def test_large_number_chat_completions(model_name, n, token, endpoint):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ]
    completion = run_completion(
        model_name, messages, token, endpoint, n=n, return_completion=True
    )
    assert len(completion["choices"]) == n

It only fell when n = 2500 with error 400:

image

But if I try to send n > 1 via requests:

@pytest.mark.parametrize("num_workers", [2])
@pytest.mark.parametrize("n", [10])
def test_many_request_and_completion(model_name, num_workers, n, token, endpoint):
    message = "Create a short story about a friendship between a cat and a dog."
    request = model_data(model_name, message, max_tokens=300, n=n)
    url = endpoint + "/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}",
    }
    responses_code_set = set()
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(send_request_get_response, url, request, headers)
            for _ in range(num_workers)
        ]
        for future in concurrent.futures.as_completed(futures):
            print(future.result().json())
            responses_code_set.add(future.result().status_code)

    assert responses_code_set == {200}

Where model_data is just:

{
        "model": model_name,
        "messages": [
            {
                "role": "user",
                "content": message,
            }
        ],
        "max_tokens": max_tokens,
        "n": n,
        "stream": stream,
        "stop": stop,
        "temperature": temperature,
        "top_p": top_p,
        "presence_penalty": presence_penalty,
        "frequency_penalty": frequency_penalty,
        "return_completion": return_completion,
    }

I got (with prod-codellama-7b-instruct-fp16):

{'object': 'error', 'message': "1 validation error for SamplingParams\n  Value error, best_of must be 1 when using greedy sampling.Got 10. [type=value_error, input_value={'n': 10, 'presence_penal...rue, 'logit_bias': None}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.5/v/value_error", 'type': 'invalid_request_error', 'param': None, 'code': None}

However, there are more strange things with this. Other tests work well at n > 1 (for example, n=500), but fail at n > 1000 with an error:

rbody = '{"object":"error","message":"The prompt is too long for the given set of engine parameters.","type":"invalid_request_error","param":null,"code":null}'
rcode = 400
resp = {'code': None, 'message': 'The prompt is too long for the given set of engine parameters.', 'object': 'error', 'param': None, ...}

In total: Currently, it is very unclear how the n should behave in the end. What limitations does it have? Why, if you make requests differently, then it behaves differently?

Red-Caesar commented 9 months ago

At 01.10 we have the following situation with endpoints and the number of chat completion: table

Red-Caesar commented 9 months ago

I found the cause of my problem when I sent parameter n through a request and received:

{'object': 'error', 'message': "1 validation error for SamplingParams\n  Value error, best_of must be 1 when using greedy sampling.Got 10. [type=value_error, input_value={'n': 10, 'presence_penal...rue, 'logit_bias': None}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.5/v/value_error", 'type': 'invalid_request_error', 'param': None, 'code': None}

The reason was that the default temperature, I was sending, was 0.0. So, it didn't work. But it works with the other temperature.

However, I think the error message of this case is a bit confusing.