ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.24k stars 9.78k forks source link

Feature Request: shared tokens in batches with `logits = true` #10295

Open Lyrcaxis opened 1 week ago

Lyrcaxis commented 1 week ago

Prerequisites

Feature Description

When batching, tokens that happen to have the same (pos, id) can be shared across multiple sequences.

However, if the last token in each sequence (the one we'd like logits for) happens to match with other tokens, they'd need to be processed as separate tokens, instead of taking advantage of the token grouping feature.

^ Not sure if bug or by design, but if a token requests logits on multiple sequences with the same (pos, id), only a single logits array will be returned by llama_get_logits, without an exception, even if the sequences are different except for that token.

What I would like to propose is to allow returning the correct amount of logits for tokens that request them, respecting the number of sequences at that shared position.

Motivation

Shared tokens in sequences contribute to reduced computational needs. When batching with llama.cpp it's sometimes more efficient to reprocess all sequences to also group together the output tokens.

That said, it would be really nice if they were grouped from the get-go, even if they had to be processed separately on the initial inference request.

Possible Implementation

It would be harmless if all batches had the same sequence length, but for varying length sequences, it could prove problematic. (e.g.: token it on pos11 of seq0 and seq1, where only seq0 requires logits would require special handling)

While I don't know much about llama.cpp's internal workings, my initial thought was to somehow handle them as separate tokens for the initial inference call, then store their cache on the appropriate places, effectively making them shared.

It would also likely require altering the llama_batch.logits to be a 2D array as well, following llama_batch.seq_id's dims.

The rest as far as I'm concerned would be magic from you guys 😄

ggerganov commented 1 week ago

Not sure if I followed the request correctly, but in case you are thinking of this case:

# 2 sequences of tokens
aaaaaaabbbbbbbbbbd
aaaaaaaccccccccccd

You cannot process the d tokens together because the prefix is different, even though they are at the same position.

Lyrcaxis commented 1 week ago

Yup, that's the case I'm having in mind. But:

# Case 1: only `A`s share a cell

aaaaaaabbbbAAbbbbbd
nnnnnnnccccAAcccccd

^ Here you can process AA together, right? afaict llama.cpp supports that, even if their prefixes aren't different. And in:

# Case 2:  all tokens except `G`, `M`, `d` share a cell

nnnnnnnccccGGcccccd
nnnnnnnccccMMcccccd

You can process all tokens together except the GG, MM, and the last ds that require logits.

So that got me thinking something similar could happen to the tokens that we request logits for, maybe with an extra step. All that so on the next decode step they'll be on the same cell and be processed quicker:

# Proposed Case post-decoding: same as Case 2, but `d`s should also share the same cell

nnnnnnnccccGGcccccdE
nnnnnnnccccMMcccccdF

Are you suggesting Case 1 and Case 2 aren't supposed to be working? Because from what I observed they do. image ^ See bottom-most on the debug info; the UsedCellCount only increases by 3

Lyrcaxis commented 1 week ago

To clarify about my motivation, here's the debug info when logits need to be requested for the last . in each sequence: image

This means that those tokens have already separate cells, and, unless I manipulate the cache and reprocess them in a single llama_batch, this'll keep stacking as sequences grow, not taking full advantage of the speed offered by token grouping.

Hope this clarifies things a little, I'm eager to hear your thoughts! Let me know if you need further clarification or info.

ggerganov commented 1 week ago

The tokens from the different sequences must "see" the same prefix tokens. This is a limitation imposed by the KQ mask which tells which token from the batch to which tokens in the context attends to (i.e. "sees"). Two tokens from different sequences, even if at the same positions, cannot share a cell because they need to "see" different prefixes and this cannot by expressed in the existing KQ mask. Even if the output logits became 2D array, the mask at least would have to be 3D I think and it already becomes unfeasible.

Note that a token can be shared by multiple sequences, as long as they have the same prefix up to now and it is allowed to potentially branch out in the future.

Lyrcaxis commented 1 week ago

I can confirm that sharing all tokens except the differing ones among seqs produces correct results with the current 2D mask: image ^ Here I have tokens share cells wherever possible, which is everywhere except occurences of A, B, C, D, and the last ' With that in mind, I don't see why ' shouldn't be share-able too, even if it HAS to be processed on separate cells initially.

My proposition/request leans more towards post-processing the cache instead of altering the internal masking or engine. So the task could be narrowed down to remapping the cache a little after the standard decoding and before exiting _decode.

ggerganov commented 1 week ago

Aha, I see your point now. The ' tokens should be possible to be shared in this case, similar to the said, my name is tokens. But as you noted, all these tokens need to be first processed in separate cells and then only merged if they produce the same token for each sequence.

I am not sure if there is an elegant way to express this - both from API and from implementation PoV. If we figure out one, we can try to support it. But atm it's unclear to me.

Lyrcaxis commented 6 days ago

Glad to hear you find it reasonable enough to consider implementing! I do think it'd be a nice boost to batching speed.

I'm gonna type some lengthy stuff out loud here so I can check for errors in logic and potentially help with brainstorm.


For the API side I (as a user), would prefer to just pass something like this to an optional helper method:

// Passed to an optional helper method. Would add a slight overhead but help with overall QoL and DX
seqs_to_decode: [
    {
        token_ids:       [0, 1, 6, 8]
        requires_logits: [f, f, f, t]
    },
    {
        token_ids:       [0, 1, 2, 8]
        requires_logits: [f, f, f, t]
    }
] // first, second, and last token gets automatically grouped

then they'd be grouped like so:

// Schema A (desired)
token_ids: [0, 1, 6, 2, 8],
pos_ids:   [0, 1, 2, 2, 3],
seq_ids: [
    [0, 1],
    [0, 1]
    [0],
    [1],
    [0, 1] // last tokens share the seq
],
requires_logits: [
    [0, 0],
    [0, 0],
    [0],
    [0],
    [1, 1] // both tokens want logits
]

which would be passed on the llama_decode method and would expect all of them decoded correctly then when I'd call llama_get_logits I'd expect a [2 * vocab_size] array as output.

(Note: This would work if no grouped tokens needed logits, but there's no current impl that retrieves multiple logits per cell)


All good so far -- except llama.cpp currently doesn't run through the FC layer to get logits on a per-seq basis. .. so, unless that is feasible, the last token would have to initially inserted as separate anyway, to process separately.

// Schema B (internally modified)
token_ids: [0, 1, 6, 2, 8, 8],
pos_ids:   [0, 1, 2, 2, 3, 3],
seq_ids: [
    [0, 1],
    [0, 1]
    [0],
    [1],
    [0], // Last token was separated
    [1]
],
requires_logits: [
    [0, 0],
    [0, 0],
    [0],
    [0],
    [1], // Last token was separated
    [1], // Both request logits
]

and once processed, state can be processed as if we went with Schema A, for which we'll resume on follow-up decode calls.


This way legacy (current) cache handling would remain supported (e.g. someone would be allowed to store KQ of tokens in separate cells), and the more efficient-for-follow-up-decoding-calls caching would be supported as well.

The 2D array for requires_logits seems to be a requirement for this approach, to handle cases like:

seqs_to_decode: [
    {
        token_ids:       [0, 1, 2, 3],
        requires_logits: [f, f, f, f] // no logits
    },
    {
        token_ids:       [0, 1, 2, 3],
        requires_logits: [f, f, f, f] // no logits
    },
    {
        token_ids:       [0, 1, 2, 3],
        requires_logits: [t, t, t, t] // all require logits
    }
] // first and second seqs get grouped, last doesn't

-- so for new tokens in a batch to share a cell in the cache they'll now need to share each of: (token_id, pos_in_seq, requires_logit) ...instead of current (token_id, pos_in_seq).


Those are my thoughts. I didn't find any errors in the logic after all!

Let me know your thoughts and/or if/how I can help. I'm not proficient enough with C++ to get hands-on in llama.cpp's internals but I could probably help brainstorm more with high-level user stories and theory/tests!

ggerganov commented 5 days ago

I was thinking about this some more and realized that this is currently supported with the existing API. Going back to the Oi, I am X, the mighty. As I, X, said, my name is ' example, after you have evaluated the Oi, I am X, the mighty. As I, X, sequences, you can submit the said, my name is batch for sequence 0 and then call llama_kv_cache_seq_cp to assign the tokens to the rest of the sequences. This copy operation does not allocate new cells - it will simply assign more sequences to the existing cells.

Lyrcaxis commented 5 days ago

It's all supported except sharing cells for multiple sequences that want logits with the same cell. (e.g.: If request_logits=true is set for ', only a single [vocab_count] is returned, instead of [num_seq * vocab_count])