Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.39k stars 982 forks source link

Would it be possible to support `n_probs` / `logprobs` in chat completion API? #409

Open cbowdon opened 4 months ago

cbowdon commented 4 months ago

Hi, first of all thank you so much for llamafile. I am very conscious of data privacy and wary of being locked-in to OpenAI, so llamafile is amazing.

There is a small disparity between the /completions endpoint and /v1/chat/completions, which is that the latter doesn't seem to support n_probs.

Here's an example of `n_probs` not working.
def chat(prompt):
    res = httpx.post(
        "http://localhost:8080/v1/chat/completions",
        json={
            "model": "LLaMA_CPP",
            "messages": [
                {"role": "user", "content": prompt}
            ],
            "n_predict": 1,
            "n_probs": 3,            
        },
        timeout=30
    )
    data = res.json()
    return data

chat("Say 'true'. Just say 'true'. Do not say anything except 'true'.")

## Output
{'choices': [{'finish_reason': 'stop',
   'index': 0,
   'message': {'content': 'true', 'role': 'assistant'}}],
 'created': 1715328516,
 'id': 'chatcmpl-1aS707tCzO40Q2gIt1zZEYfg7etlQSMZ',
 'model': 'LLaMA_CPP',
 'object': 'chat.completion',
 'usage': {'completion_tokens': 7, 'prompt_tokens': 46, 'total_tokens': 53}}

Sadly the OpenAI logprobs and top_logprobs parameters didn't work either. It looks like it's because they are not mapped here in the OpenAI compatibility function:

https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/oai.h#L20

I'm not brave or competent enough to try and make the change myself, but I think the necessary logic would be:

    bool logprobs = json_value(body, "logprobs", false);
    int top_logprobs = json_value(body, "top_logprobs", 0);

    int n_probs;
    if (top_logprobs > 0) {
        n_probs = top_logprobs;
    } else if (logprobs == true) {
        n_probs = 1;
    } else {
        n_probs = 0;
    }

    llama_params["n_probs"]         = n_probs;

That should emulate the OpenAI behaviour described here.

Please consider supporting this as it would be very convenient. Manually calling /completion with chat templates is how I'm working around it at the moment.