Closed viniciusarruda closed 1 year ago
because Tokenizer.tokenize() always add a space in front of the string passed in.
29871 is the space
just to explain what I meant by "space", see here: https://github.com/google/sentencepiece/blob/master/README.md#whitespace-is-treated-as-a-basic-symbol
because Tokenizer.tokenize() always add a space in front of the string passed in.
I don't think it does that, because I just tested it here. But may be this happens elsewhere.
For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:
For prompt "Hello world": llama.cpp tokenizer: [10994, 3186] Meta tokenizer: [15043, 3186] For prompt " Hello world": llama.cpp tokenizer: [15043, 3186] Meta tokenizer: [29871, 15043, 3186]
Running the tests I see the Meta tokens now. But I surely need guidance on how to integrate the necessary changes into llama.cpp
.
BTW: I'd expect a couple of more problems for other tests.
@viniciusarruda and others ; nice find, indeed! Now that I've learned a bit about the tokenizer: this kind of problem could not only occur with llama
but with derived models using different tokenizers (w.r.t. vocabulary for example) ? But the common theme here is sentencepiece
compatibility? You made a big step with your notebook (which I used for testing already, thanks!), could you advise us regarding our future testing strategy?
i am curious, from openai repo, How_to_count_tokens_with_tiktoken,it says Encodings specify how text is converted into tokens. Different models use different encodings,so i wonder if we are on the right way? did we only need to match the meta?
I'd like to add to this discussion that the real llama tokenizer eats spaces at the beginning making it very bad for continuous generation. Llamacpp's tokenizer was previously tested to give the same results as the HF tokenizer with our workaround. So for continuous generation the behavior might be better. If this is changed in the code it would be nice if this would be an optional change since if the tokenizer begins eating spaces this breaks use cases.
The tokenizer eating spaces is only desirable for applications with a seperate output window.
I'd like to add to this discussion that the real llama tokenizer eats spaces at the beginning making it very bad for continuous generation. Llamacpp's tokenizer was previously tested to give the same results as the HF tokenizer with our workaround.
Do you have a possible test scenario for the proposed PR in mind? I don't know how to check that otherwise.
Yes, and the upstream tokenizer workarounds are not perfect so I will give you a method to test the ideal scenario. Our UI strips a space at the end of a generation because most models can not handle it well if its already there because their tokens often have spaces in front of them, so we let the model decide if there should be a space or not.
So imagine a sentence that does not end with a space such as "This is an example." in a UI that lets users add to that text. The correct scenario would be this generation " It works very well." which combined forms "This is an example. This works very well."
The original Llama tokenizer at least on HF doesn't do this because they trained sentencepiece in a mode where it swallows the first space. So you'd get this "It works very well." resulting in "This is an example.It works very well".
Likewise you also don't want to always insert the space if you don't have to, "My favorite sport is foot" should auto complete to "ball and I play it every week" not " ball and I play it every week".
Adhering to that behavior would make it usable for continuous generation use cases because then the text flows naturally and the tokenizer picks the most appropriate token for the end result, rather than never picking a space or always picking a space.
On the huggingface side we force the tokenizers hand, we insert a random token such as a comma at the front of the input, and then after its tokenized we remove the first fake token which results in it doing a proper continuous generation. The behavior we typically observed from Llamacpp is very similar to that workaround. And we originally implemented workarounds like this because Llamacpp did a better job and its helped us in the past to identify things were wrong.
Ad 1.
.\build\bin\Release\main.exe -m models\llama-7B\ggml-model-q4_0.bin -p "This is an example." --temp 0 --color
With main:
This is an example. It's a bit of a mess, but it shows how to use the new features in the latest version of the library.
The first thing you need to do is create a new project and add the library to it. You can find out more about this here.
With PR:
This is an example. It's a bit of a mess, but it shows how to use the new features in the latest version of the library.
The first thing you need to do is create a new project and add the library to it. You can find out more about this here.
[...]
Ad 2.
.\build\bin\Release\main.exe -m models\llama-7B\ggml-model-q4_0.bin -p "My favorite sport is foot" --temp 0 --color
With main:
My favorite sport is foot ball. I like to play it and watch it on TV.
I like to play football, too. It's my favourite sport. [end of text]
With PR:
My favorite sport is foot ball. I like to play it and watch it on TV.
I like to play football, too. It's my favourite sport. [end of text]
I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).
I believe to understand the current llama.cpp
tokenizer compromises between LLaMA compatibility and the ability to process similar LLMs. I also think we can increase compatibility (slightly?), but that will need some more testing for the other LLMs?
Maybe this warrants further investigation. I just ran
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q4_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q4_1.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q5_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q5_1.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q8_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
Main:
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 71.03 seconds per pass - ETA 3 minutes
[1]4.4063,[2]4.9216,[3]5.8127,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 76.90 seconds per pass - ETA 3 minutes
[1]4.3950,[2]4.8454,[3]5.7376,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 102.27 seconds per pass - ETA 5 minutes
[1]4.2354,[2]4.7553,[3]5.6449,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 102.33 seconds per pass - ETA 5 minutes
[1]4.2480,[2]4.7375,[3]5.6165,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 121.01 seconds per pass - ETA 6 minutes
[1]4.2251,[2]4.7076,[3]5.5812,
With PR:
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 72.39 seconds per pass - ETA 3 minutes
[1]4.4161,[2]4.9270,[3]5.8170,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 77.84 seconds per pass - ETA 3 minutes
[1]4.3436,[2]4.8170,[3]5.7152,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 105.67 seconds per pass - ETA 5 minutes
[1]4.2477,[2]4.7622,[3]5.6503,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 99.47 seconds per pass - ETA 4 minutes
[1]4.2295,[2]4.7272,[3]5.6083,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 120.07 seconds per pass - ETA 6 minutes
[1]4.2187,[2]4.7040,[3]5.5784,
Found this which seems to fix at least the test cases I have. I don't know why and I didn't test setting bos
to True
.
I updated the code to produce the results by adding the b' '
.
It might be helpful to understand the behavior of sentencepiece. In this notebook cell ("Load model from byte stream"), it includes a "▁" (U+2581)
even before the first word. So, if the reverse mapping requires a space before the token, it will not work.
@viniciusarruda : tokenizer fixes are in via the gguf
branch. Does this work for you?
In the meantime I tried to understand possible invariants for sentencepiece
tokenizer with the following tests:
#define _CRT_SECURE_NO_WARNINGS
#include <cassert>
#include <iostream>
#include <algorithm>
#include "sentencepiece_processor.h"
#include "console.h"
#include "utf.h"
std::string random_string(size_t length) {
auto randchar = []() -> char {
const char charset[] =
" \t\n"
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
const size_t max_index = (sizeof(charset) - 1);
return charset[rand() % max_index];
};
std::string str(length, 0);
std::generate_n(str.begin(), length, randchar);
return str;
}
int main(int, char**) {
console_init();
auto spp = sentencepiece::SentencePieceProcessor();
auto result = spp.Load("lama/llama-7B/tokenizer.model");
if (result.ok()) {
for(codepoint cp = 0x0000; cp <= 0xffff; ++ cp) {
if(cp < 0xd800 || cp > 0xdfff) {
auto utf8 = to_utf8(cp);
auto ids = spp.EncodeAsIds(utf8);
auto check = spp.DecodeIds(ids);
if(utf8 != check)
std::cout << cp << " >" << utf8 << "< >" << check << "<" << std::endl;
}
}
for(codepoint cp = 0x10000; cp <= 0x10ffff; ++ cp) {
auto utf8 = to_utf8(cp);
auto ids = spp.EncodeAsIds(utf8);
auto check = spp.DecodeIds(ids);
if(utf8 != check)
std::cout << cp << " >" << utf8 << "< >" << check << "<" << std::endl;
}
}
for(auto i = 0; i < 1024; ++ i) {
auto str = random_string(64);
for(auto j = 0; j < str.length(); ++ j) {
auto ids1 = spp.EncodeAsIds(str.substr(0, j));
auto ids2 = spp.EncodeAsIds(str.substr(j));
auto str1 = spp.DecodeIds(ids1);
auto str2 = spp.DecodeIds(ids2);
if (str1 + str2 != str)
std::cout << "1: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
}
}
for(auto i = 0; i < 1024; ++ i) {
auto str = random_string(64);
auto ids = spp.EncodeAsIds(str);
for(auto j = 0; j < ids.size(); ++j) {
auto str1 = spp.DecodeIds(std::vector(ids.begin(), ids.begin() + j));
auto str2 = spp.DecodeIds(std::vector(ids.begin() + j, ids.end()));
if (str1 + str2 != str)
std::cout << "2: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
}
}
return 0;
}
What would you expect to see?
9601 >▁< > <
2: >tcNSM8fcwo< >nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6< >tcNSM8fcwo nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6<
[snip]
When rechecking test-tokenizer-1
I noticed someone decided to let these tests run only for BPE
tokenizers and enabling them for sentencepiece
based ones seems broken.
When I added the BPE tokenizer for Falcon, I disabled test-tokenizer-1
since my understanding was that it is supposed to be used only with BPE tokenizers and the current implementation of the BPE tokenizer is known to not work. Didn't think it was supposed to work with SPM
When I added the BPE tokenizer for Falcon, I disabled
test-tokenizer-1
since my understanding was that it is supposed to be used only with BPE tokenizers and the current implementation of the BPE tokenizer is known to not work. Didn't think it was supposed to work with SPM
A fast moving project, I see. I'm working on fixes, now that I have some kind of test bed. But this looks like a tough one, so don't expect results immediately.
@goerch Thank you for looking into it. Things are indeed moving fast and mistakes can happen.
The top priority is having the SPM tokenizer work 100% correct. After #2810 I thought we have achieved this, but based on your findings, it might not be the case yet. The test-tokenizer-0
should be expanded with examples where we currently fail. The test-tokenizer-0.py
script can be used to generated ground truth.
@ggerganov : I believe some of the remaining issues could be related to this comment. @howard0su , @klosax : AFAIU nobody looked into this so far, any reason for it?
The
test-tokenizer-0
should be expanded with examples where we currently fail. Thetest-tokenizer-0.py
script can be used to generated ground truth.
Nice. I'll try this one next to cover the Unicode character set (that should at least demonstrate my 'high byte'(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here?
Let's create a PR with a few failing cases and will try to give it more visibility so we get some help with it. I'm afraid I'm out of my depth here and won't be able to understand how the normalization works and what it takes to implement it, so hopefully people familiar with it will help out
Nice. I'll try this one next to cover the Unicode character set (that should at least demonstrate my 'high byte'(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here?
I was mistaken: testing high byte output with
for tok in range(0, 3 + 256):
print("'" + tokenizer.decode([tok]) + "'")
and
for (int i = 0; i < 3 + 256; ++i) {
std::string str = llama_detokenize_spm(ctx, std::vector<int>(1, i));
fprintf(stdout, "%s\n", str.c_str());
}
shows identical output.
If @viniciusarruda and @ggerganov don't disagree : I think we should close this issue and wait for possible Unicode normalization issues to show up (otherwise I need help to design a test case).
I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).
For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:
Exploring the tokens, doing the detokenization, I got:
Exploring each token above with the
id_to_piece
functionality:But, using the detokenizer in each id individually:
The code used to produce this results can be seen here. Use this file for the Meta tokenizer. The model
ggml-model-f16.bin
is the 7B LLaMA model after using theconvert.py
script as mentioned here.