Feature Request: Add vocabulary type for token-free models that work on raw bytes

uwu-420 commented 1 month ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I think it would be useful if llama.cpp supported a vocabulary type that doesn't really have tokens but only works on raw bytes. Something like LLAMA_VOCAB_TYPE_RAW_BYTES would be added to enum llama_vocab_type but I don't know what kind of changes that would imply elsewhere. That kind of vocabulary would still require special tokens of course.

Motivation

There's already some interesting research about making token-free LLMs work:

And I think this is going to become even more relevant in the future. To quote Andrej Karpathy: "I would love nothing more than to be able to feed raw byte sequences into language models".

Possible Implementation

No response

teleprint-me commented 3 weeks ago

I don't have the time or bandwidth to go through this at the moment, but I am curious.

Considering that raw bytes have been considered multiple times in the past and that it increases the time complexity of the computations to train and inference a model, how is this any different? Using raw UTF-8 sequencing would be nice because it would technically be "vocabulary free".

However, the rationale for using merged tokens (BPE) is typically that it's a compromise between using full words (vocab size) and raw characters (UTF-8). Character based vs sub-word tokenization isn't novel and these issues are well documented and known already.

uwu-420 commented 3 weeks ago

Thanks for bringing that up. Those are all valid and true points. I'd still like to provide some more context.

Concerning the extra computational effort, the disadvantage could be compensated by the advancements in speculative decoding or using using multiple decoding heads which at least improves things when it comes to inference.

Imo it feels like we're stuck in a local optimum with tokenization methods like BPE. It's the best we have at the moment, but it's still fundamentally flawed. Think of current LLMs failing at tasks such as reversing words or counting letters and so on. It's all mostly due to subword tokens. Brittleness in the face of typos would be another issue that comes to mind. The ByT5 paper explicitly addresses how byte level LLMs handle this way better.

teleprint-me commented 3 weeks ago

I agree on the brittleness of current models and the issues are well known. There's PR #7187 for token healing to handle cases where incomplete tokens cause issues.

Even if all of these issues are solved, it doesn't solve the larger issue of the embedding space for the vocabulary. The vocabulary still needs to be mapped to values; e.g. the encoder and decoder translate between the input and the representative numerical values of the input and output.

In this context, those numerical values would represent language(s).

I have skimmed the ByT5 and Bytes are all you need papers before, though I haven't dug into them as much as I'd like.

Not sure if Medusa is really the answer, although reducing MatMul operations might help.

Ideally, reducing dependency upon other models (such as augmentation or speculation) would be ideal. I'd prefer to simplify components instead of compounding them.

There's always value in exploring any of these avenues, so I don't say this to deter you. There's added value in removing any uncertainty in situational awareness.

I think it's worth mentioning discussion #7732 here as well as it is relevant.

ggerganov / llama.cpp