ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.01k stars 9.32k forks source link

MPT: tokenization crashes #3604

Closed jploski closed 5 months ago

jploski commented 11 months ago

Testing against 5974d617 and https://github.com/ggerganov/llama.cpp/pull/3538

While running

bin/main --mlock -m /mnt/f2fs/mpt/ggml-model-mpt-7b-storywriter-f16-q5_1.gguf -t 1 -ngl 999 -p 'Once upon a time' --temp 0.8 --top_p 0.98 -c 2048 --keep -1 --repeat_penalty 1 -n 1024

It consistently crashes after a few (~hundred) tokens with this backtrace:

Thread 1 "main" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fffcefae535 in __GI_abort () at abort.c:79
#2  0x00007fffcf376983 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffcf37c8c6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffcf37c901 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffcf37cb34 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffcf37886b in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000555555612915 in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ")
    at /usr/include/c++/8/bits/hashtable_policy.h:760
#8  0x000055555560b295 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned char, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> > >::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ") at /usr/include/c++/8/bits/unordered_map.h:991
#9  0x00005555555cc65a in unicode_to_bytes_bpe (utf8=" ") at /mnt/seagate/dalai/llama.cpp.bak/unicode.h:460
#10 0x00005555555f8398 in llama_decode_text (text="  ") at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9703
#11 0x00005555555f8663 in llama_token_to_piece (model=0x5555a4966c40, token=50276, buf=0x555555e3c6b0 "", length=8) at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9746
#12 0x000055555557e0ce in llama_token_to_piece[abi:cxx11](llama_context const*, int) (ctx=0x5555b579e350, token=50276) at /mnt/seagate/dalai/llama.cpp.bak/common/common.cpp:894
#13 0x00005555555b028b in llama_sampling_sample (ctx=0x5555b579e350, ctx_guidance=0x0, ctx_sampling=..., last_tokens=std::vector of length 2048, capacity 2048 = {...}, 
    candidates=std::vector of length 50432, capacity 50432 = {...}, idx=0, seq=0) at /mnt/seagate/dalai/llama.cpp.bak/common/sampling.cpp:151
#14 0x000055555556a2aa in main (argc=22, argv=0x7fffffffdcd8) at /mnt/seagate/dalai/llama.cpp.bak/examples/main/main.cpp:648

(The GGUF/vocab was exported using convert-mpt-hf-to-gguf.py from the aforementioned commit.)

jploski commented 11 months ago

I'm also seeing the same crash for mpt-30b-instruct (which is no wonder, same vocab/tokenizer).

jploski commented 11 months ago

@goerch Please see above. With this "obvious" hack/fix it does not crash:

diff --git a/unicode.h b/unicode.h
index aeca879..eaf7134 100644
--- a/unicode.h
+++ b/unicode.h
@@ -452,6 +452,7 @@ static std::unordered_map<std::string, uint8_t> unicode_to_bytes_map_bpe() {
             ++n;
         }
     }
+    map[codepoint_to_utf8(u' ')] = ' ';
     return map;
 }

But I don't understand how the particular hardcoded character ranges in unicode_to_bytes_map_bpe were selected, so not sure if this is the correct approach.

staviq commented 11 months ago

Testing against 5974d61 and #3538

I'm confused, #3538 didn't touch any of the code from that backtrace, and isn't even merged yet ?

cebtenzzre commented 11 months ago

This is because of the reasons described in #3405. I have not yet identified how to port the changes in that pull request to goerch's new tokenizer code. How do we currently differentiate between normal and added tokens for MPT in the C++ code?

staviq commented 11 months ago

@cebtenzzre

Looking at unicode.h

static uint8_t unicode_to_bytes_bpe(const std::string & utf8) {
    static std::unordered_map<std::string, uint8_t> map = unicode_to_bytes_map_bpe();
    return map.at(utf8);
}

There is no validation before .at()

I think it should be at least something along the lines of:

static uint8_t unicode_to_bytes_bpe(const std::string & utf8) {
    static std::unordered_map<std::string, uint8_t> map = unicode_to_bytes_map_bpe();
    return map.find(utf8) != map.end() ? map.at(utf8) : "";
}
cebtenzzre commented 11 months ago

I think it should be at least something along the lines of:

I think that's wrong. This code is assuming that all tokens in the vocabulary are specially encoded so they are valid JSON strings (which makes me wonder why we are now saving them this way in the GGUF files, as GGUF does not have the same restrictions). But added tokens are not specially encoded, so we must not call unicode_to_bytes_bpe on them. All normal tokens are guaranteed to be encoded this way.

staviq commented 11 months ago

@cebtenzzre I meant not as a fix but in general, the map doesn't contain all possible strings so map.at() will crash if something not in the map gets passed as the argument. It's not that important, doesn't seem to be the cause.

I sort of think I know what might be going on here

Unicode codepoint for space is the same as ascii 0x20 (32)

unicode_to_bytes_bpe crashes when a space is passed as an argument, because map doesn't have anything under that index

When the map is created, space is added by the last for loop, at index with offset of 256:

https://github.com/ggerganov/llama.cpp/blob/1e0e873c373c33989beb6bc64d83cd572ab7fe2b/unicode.h#L434-L456

Except that space comes from codepoint_to_utf8 in

https://github.com/ggerganov/llama.cpp/blob/1e0e873c373c33989beb6bc64d83cd572ab7fe2b/llama.cpp#L9437

Which doesn't use any offsets and returns a normal space for that codepoint:

https://github.com/ggerganov/llama.cpp/blob/1e0e873c373c33989beb6bc64d83cd572ab7fe2b/unicode.h#L225-L229

But tracing this back even further, the " " comes directly from the vocab / token.text, except I understand space in BPE vocab should be "Ġ" ( that's utf 256+32 )?

Tracing it even further back, wouldn't it mean convert-mpt-hf-to-gguf.py imports tokens ( their "text" ) wrong ?

@cebtenzzre, is that what you meant ? Am I understanding this correctly ?

Even if I'm wrong here, I think this bit of code could really use some, even rudimentary, comments

cebtenzzre commented 11 months ago

Tracing it even further back, wouldn't it mean convert-mpt-hf-to-gguf.py imports tokens ( their "text" ) wrong ?

Not anymore, because goerch recently moved all of the vocab conversion to llama.cpp. It is no longer the conversion script's responsibility. The different parts of the JSON have different ways of encoding the vocabulary - added tokens are stored as-is, whereas normal tokens need bytes_to_unicode/unicode_to_bytes. My PR changed the conversion script to only decode the normal tokens, but we no longer decode any tokens in the conversion script, so that will not work.

I believe in the map.at(utf8) call, the input comes directly from the vocabulary, so it should only crash if the code is incorrect (which it is in this case, because we should not call this function on added tokens) or if the GGUF file is corrupt/incorrectly converted. I don't think silently generating blank output tokens is a good idea in either of those cases.

staviq commented 11 months ago

I was testing something else when I noticed the entire vocab for mpt-7b-storywriter is marked as LLAMA_TOKEN_TYPE_NORMAL when it is loaded in llm_load_vocab, so just to be sure, I re-converted (and quantized) again with current master (11dc1091f64b24ca6d643acc6d0051117ba60161) and it's the same.

So if all (BPE) tokens are marked as normal in the .gguf, how can we tell which tokens should get the bytes_to_unicode/unicode_to_bytes treatment ?

In case of mosaicml/mpt-7b-storywriter, special/added tokens are only listed in tokenizer.json, except during the conversion, that json file seems to only be used (in gguf.py) to get IDs for special_token_types: tuple[str, ...] = ('bos', 'eos', 'unk', 'sep', 'pad')

I'm not sure if I'm missing something or I got this wrong.

jploski commented 11 months ago

I was testing something else when I noticed the entire vocab for mpt-7b-storywriter is marked as LLAMA_TOKEN_TYPE_NORMAL when it is loaded in llm_load_vocab, so just to be sure, I re-converted (and quantized) again with current master (11dc109) and it's the same.

So if all (BPE) tokens are marked as normal in the .gguf, how can we tell which tokens should get the bytes_to_unicode/unicode_to_bytes treatment ?

In case of mosaicml/mpt-7b-storywriter, special/added tokens are only listed in tokenizer.json, except during the conversion, that json file seems to only be used (in gguf.py) to get IDs for special_token_types: tuple[str, ...] = ('bos', 'eos', 'unk', 'sep', 'pad')

I'm not sure if I'm missing something or I got this wrong.

I suspect you're right and the convert scripts (not just for MPT, but for any model with BPE tokenizer) should be updated to classify tokens (based on https://github.com/ggerganov/llama.cpp/pull/3538#issuecomment-1758219448) as CONTROL (if found in added_tokens and special: true in tokenizer.json), USER_DEFINED (if found in added_tokens and special: false in tokenizer.json), UNUSED (if artificially added just to make vocab match model.vocab_size like the MPT pad tokens), NORMAL otherwise.

But as it stands, CONTROL is used for all BPE added_tokens in convert.py (regardless of the "special" flag)... and all the other convert scripts just output all tokens as NORMAL...

ggerganov commented 11 months ago

Catching up with PRs - did I merge #3538 too soon? Sorry if that's the case. AFAIU, we need to implement the token classification as explained in @jploski comment above, correct?

jploski commented 11 months ago

Catching up with PRs - did I merge #3538 too soon? Sorry if that's the case. AFAIU, we need to implement the token classification as explained in @jploski comment above, correct?

I don't think the merge was too soon - at least in the sense that it doesn't make matters worse than they were without the merge. The bug is in most likely in the convert scripts, and these was already merged in earlier (I suspect I didn't catch the crash in my prior MPT testing because I was not generating random enough content to cause the special token to appear in the output).

I think it would make sense if @goerch also had a look at this issue - because convert-mpt was essentially me copying the convert-gptneox script with small adjustments.

staviq commented 11 months ago

Catching up with PRs - did I merge #3538 too soon? Sorry if that's the case. AFAIU, we need to implement the token classification as explained in @jploski comment above, correct?

I verified this is unrelated, forgot to post that info, it was reproducible on master prior to #3538 merge

goerch commented 11 months ago

With #3728 and

build/bin/Release/main.exe" -m models\mpt-7b-storywriter\ggml-model-f16.gguf -ngl 999 -p 'Once upon a time' --temp 0.8 --top_p 0.98 -c 2048 --keep -1 --repeat_penalty 1 -n 1024

I get

system_info: n_threads = 6 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.980, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 2048, n_batch = 512, n_predict = 1024, n_keep = 4

Once upon a time there was a girl who went by the name of Della.

**Della** , the girl who had come to live in a place called the Mansion. The girl who got herself into trouble and was now in trouble. The girl who couldn't tell what was real and what wasn't.

She was also the girl who had found a secret.

She couldn't seem to get away from it, even now, in the middle of the night.

Della wasn't afraid of a haunted house. Not really.

Della had lived in haunted houses all her life. The mansion and its secrets were just the latest haunted house Della had found herself in.

Della felt tired. She lay in the bed in the nursery, watching the door. "Hello?" she said. "Anybody home?"

Then she heard a voice.

"I heard you! I heard you!"

"Who's there?" Della whispered, but there was no answer.

She felt a presence.

"Hello?" Della called out. "Are you okay? Am I dreaming?"

"I'm not dreaming," the voice said. It was a ghost's voice.

Della looked around the room, at the toys on the shelves, the pictures on the walls.

In the middle of the room, under the window, a shadow appeared. It was dark and shaped like a woman.

"Who are you?" Della asked, and then she heard the ghost's response.

"I'm your guardian angel," the ghost whispered, and Della realized that she was not alone.

"Guardian angel?" Della asked. "What are you doing here?"

"I'm here to protect you," the ghost said.

Della stood up and looked at the shadow. It stood next to the crib.

"You're my guardian angel?" Della asked. "How old are you?"

"I'm as old as the oldest ghost," the ghost said, "and younger than the newest ghost."

"Okay," Della said. "I guess that makes sense. Does that mean you're a little bit of a ghost?"

"I'm a little bit of a ghost," the ghost said, "and a little bit of a girl. I was a girl once. Now I'm a ghost."

Della looked at the guardian angel's body, then at her face. She was wearing a long white dress with lace at the wrists and at the collar.

Della was a little afraid. "You're a ghost, but your body is not a ghost's body."

"I'm not a ghost," the guardian angel said. "But I'm like a ghost. I'm not alive. I'm not dead. I'm the living dead."

Della frowned. "Then what are you?"

"I'm a ghost with a ghost body," the guardian angel said. "I'm a ghost with a ghost body that looks like a body, but it's not alive."

Della frowned. "Why are you here?"

"I want to help you," the ghost said, "because I love you."

Della frowned. "I don't love you."

"You do," the ghost said. "I can tell. You love me."

"I don't love you," Della said.

"You love me," the ghost said. "I can feel it."

"You're crazy," Della said.

"I'll prove it," the ghost said. "I'll tell you the truth about who you are."

Della frowned. "Why are you telling me the truth?"

"Because you're in so much trouble," the ghost said. "You need to know the truth."

Della looked at the guardian angel's face. "You can't know the truth."

"Yes, I can," the ghost said.

Della looked at the guardian angel's face. "You're the only one who can tell me the truth."

"I can tell you the truth," the ghost said, "because I know everything."

Della's eyes widened. "You do?"

"Yes," the guardian angel said. "I know everything."

Della frowned. "You're lying."

"No," the ghost said. "I'm not a ghost. I'm not a girl. I'm a ghost with a ghost body. I'm a ghost with a ghost body and I know everything."

Della's eyes narrowed. "You mean you don't know everything."

"No," the ghost said, "but I know everything."

Della squinted. "You can't know everything."
llama_print_timings:        load time =    5862.36 ms
llama_print_timings:      sample time =      56.62 ms /  1024 runs   (    0.06 ms per token, 18084.84 tokens per second)
llama_print_timings: prompt eval time =    6125.56 ms /     4 tokens ( 1531.39 ms per token,     0.65 tokens per second)
llama_print_timings:        eval time = 2244789.36 ms /  1023 runs   ( 2194.32 ms per token,     0.46 tokens per second)
llama_print_timings:       total time = 2251565.26 ms
Log end

Not sure if we have a different problem, though.

staviq commented 11 months ago

@goerch With that particular model, you have to let it spin for quite a while sometimes to trigger the problem, and at some point it will produce token ( double space ), which causes the crash, because those spaces make to the detokenizer as ascii space 0x20 instead of expected 0xff+0x20.

goerch commented 11 months ago

With that particular model, you have to let it spin for quite a while sometimes to trigger the problem

Thanks, the original report states:

It consistently crashes after a few (~hundred) tokens with this backtrace...

and I let it run for the requested 1024 tokens. Not sure how often I should repeat that?

Edit: BTW the patch should make sure that added tokens are USER_DEFINED in the detokenizer.

jploski commented 11 months ago

With that particular model, you have to let it spin for quite a while sometimes to trigger the problem

Thanks, the original report states:

It consistently crashes after a few (~hundred) tokens with this backtrace...

and I let it run for the requested 1024 tokens. Not sure how often I should repeat that?

Edit: BTW the patch should make sure that added tokens are USER_DEFINED in the detokenizer.

I can confirm that the code from #3728 no longer crashes.

If you wish to reproduce the crash, you can do so using llama.cpp code from master (22c69a27), but only if you convert the model without USER_DEFINED, just using NORMAL tokens (i.e. using master's version of the conversion script). Note that code on master does not crash when running with a model converted by #3728.

And interestingly, #3728 does not crash even if only NORMAL tokens are output.

goerch commented 11 months ago

And interestingly, #3728 does not crash even if you only NORMAL tokens are output.

Then the following old code in find_bpe_rank (which I didn't understand and therefore didn't touch before)

        replace_all(token_left,  " ",  "\u0120");
        replace_all(token_left,  "\n", "\u010A");
        replace_all(token_right, " ",  "\u0120");
        replace_all(token_right, "\n", "\u010A");

could be the culprit.

And interestingly, https://github.com/ggerganov/llama.cpp/issues/3728 does not crash even if you only NORMAL tokens are output.

Not sure if we should revert the changes in convert-mpt-hf-to-gguf.py or if they are in line with @staviq 's special token handling (#3538)? Does the tokenizer test for MPT still work for all NORMAL tokens?

goerch commented 11 months ago

Just another smoke test:

perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 961.45 seconds per pass - ETA 48.07 minutes
[1]6.3847,[2]7.5359,[3]8.0378,
Final estimate: PPL = 8.0378 +/- 0.76111
Galunid commented 11 months ago

@goerch I run into the same issue in #3586. A simple way to reproduce it is to just pass (double space) as part of the prompt

#10 0x00005555555bcc54 in unicode_to_bytes_bpe (utf8=" ") at /home/kris/llama2/unicode.h:460
#11 0x00005555555eff1c in llama_decode_text (text="  ") at llama.cpp:10234
#12 0x00005555555f023f in llama_token_to_piece (model=0x555555c9dc50, token=50276, buf=0x55555599e390 "", 
    length=8) at llama.cpp:10277
#13 0x000055555565016d in llama_token_to_piece[abi:cxx11](llama_context const*, int) (ctx=0x555556c655c0, 
    token=50276) at common/common.cpp:930
#14 0x000055555556bdb8 in LOG_TOKENS_TOSTR_PRETTY<llama_context*, std::vector<int, std::allocator<int> > > (
    ctx=@0x7fffffffbe68: 0x555556c655c0, tokens=std::vector of length 5, capacity 19 = {...}) at common/log.h:597
#15 0x0000555555561867 in main (argc=23, argv=0x7fffffffdb68) at examples/main/main.cpp:247
goerch commented 11 months ago

Does the tokenizer test for MPT still work for all NORMAL tokens?

It doesn't seem so.

goerch commented 11 months ago

I run into the same issue in #3586

Without looking into StableML I'd assume you have a base tokenizer and added tokens via fine tuning. Our handling of added tokens still looks somewhat fragile (on the conversion and testing side).

staviq commented 11 months ago

@goerch I run into the same issue in #3586. A simple way to reproduce it is to just pass (double space) as part of the prompt

#10 0x00005555555bcc54 in unicode_to_bytes_bpe (utf8=" ") at /home/kris/llama2/unicode.h:460
#11 0x00005555555eff1c in llama_decode_text (text="  ") at llama.cpp:10234
#12 0x00005555555f023f in llama_token_to_piece (model=0x555555c9dc50, token=50276, buf=0x55555599e390 "", 
    length=8) at llama.cpp:10277
#13 0x000055555565016d in llama_token_to_piece[abi:cxx11](llama_context const*, int) (ctx=0x555556c655c0, 
    token=50276) at common/common.cpp:930
#14 0x000055555556bdb8 in LOG_TOKENS_TOSTR_PRETTY<llama_context*, std::vector<int, std::allocator<int> > > (
    ctx=@0x7fffffffbe68: 0x555556c655c0, tokens=std::vector of length 5, capacity 19 = {...}) at common/log.h:597
#15 0x0000555555561867 in main (argc=23, argv=0x7fffffffdb68) at examples/main/main.cpp:247

Oh, I forgot about it, tokenized prompt gets immediately detokenized when it's printed to the log. This seems like a nice fast way to reproduce it without having to let it generate.

The problem (this particular one) is still with the detokenizer rather than tokenizer.

Edit: Vocab preprocesing adds a single print to the output with special token stats and verification result, you can use it to quickly see if a model vocab is correct or not:

llama_model_loader: - type q8_0:  130 tensors
llm_load_vocab: mismatch in special tokens definition ( 159/50432 vs 0/50432 ).
llm_load_print_meta: format           = GGUF V2 (latest)
goerch commented 11 months ago

The problem (this particular one) is still with the detokenizer rather than tokenizer.

Or with token classification in the conversion?

Galunid commented 11 months ago

Without looking into StableML I'd assume you have a base tokenizer and added tokens via fine tuning. Our handling of added tokens still looks somewhat fragile (on the conversion and testing side).

That seems to be the case

tokenizer.json (added tokens only) ```json "added_tokens": [ { "id": 0, "content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 1, "content": "<|padding|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 50254, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50255, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50256, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50257, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50258, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50259, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50260, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50261, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50262, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50263, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50264, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50265, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50266, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50267, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50268, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50269, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50270, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50271, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50272, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50273, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50274, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50275, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, { "id": 50276, "content": " ", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false } ], ```
goerch commented 11 months ago

That seems to be the case

From the looks of it #3728 could be applicable.

staviq commented 11 months ago

The problem (this particular one) is still with the detokenizer rather than tokenizer.

Or with token classification in the conversion?

From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.

goerch commented 11 months ago

From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.

I looked at this code and am having doubts if we should count CONTROL and BYTE tokens as special tokens, for example. No need to change this yet, just a heads up. I'll have to understand which tokens you add to the special tokens cache.

staviq commented 11 months ago

From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.

I looked at this code and am having doubts if we should count CONTROL and BYTE tokens as special tokens, for example. No need to change this yet, just a heads up. I'll have to understand which tokens you add to the special tokens cache.

That bit of code, is for simple sanity check, and currently only is used to print to the log whether token labels in the vocab match the manually extracted special tokens.

I am still considering this approach, of manually extracting special tokens from the vocab, as nothing more than a stable workaround and eventual fallback solution.

Method I used is quite simple

Iterate over the entire vocab and for each token take its string representation

For the given token string representation, split it in two substrings ( at 1st byte, 2nd byte,...,strlen-1 byte ) and check if for any of those splits, both halves have a matching token in the vocab, if not, mark as special token.

That's pretty much it.

goerch commented 11 months ago

Are we good here for now? I intend to follow up on the multiple space issue. Should we track it here?

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.