[BUG] Qwen's huggingface tokenizer return`bytes`-typed tokens, not `str`-types tokens

silverriver commented 6 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Regular tokens in Qwen's tokenzier is represented as bytes. The huggingface tokenizer implemented in Qwen's hf model return bytes-typed tokens: https://huggingface.co/Qwen/Qwen-7B/blob/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/tokenization_qwen.py#L136

    def get_vocab(self) -> Dict[bytes, int]:
        return self.mergeable_ranks

However, huggingface tokenzier interface use str-typed tokens: https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/tokenization_utils_base.py#L1666

    def get_vocab(self) -> Dict[str, int]:
        """
        Returns the vocabulary as a dictionary of token to index.

        `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the
        vocab.

        Returns:
            `Dict[str, int]`: The vocabulary.
        """
        raise NotImplementedError()

Some applications may take str-typed tokens as their default. Such as: https://github.com/outlines-dev/outlines/blob/6484d8c5439fa0744656bcc05794592635f4533c/outlines/integrations/utils.py#L59

期望行为 | Expected Behavior

Use str-typed tokens in hf implementations

复现方法 | Steps To Reproduce

n/a

运行环境 | Environment

n/a

备注 | Anything else?

n/a

jklj077 commented 6 months ago

This is documented in https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md#regular-tokens.

You can use Qwen2Tokenizer instead if you need tokens in str. (Please be aware that due to the tokenization mechanism those tokens are encoded bytes. You need to decode them to get the actual string, as what is done in original gpt2 tokenizer.)

silverriver commented 6 months ago

Thank you for your reponse.

Qwen2Tokenizer seems to work in my case. However, I have a few questions regarding to this tokenizer.

Can I use Qwen2Tokenizer to load tokenzier provided in "https://huggingface.co/Qwen/Qwen-72B"? If not how can I convert these tokenizer files in tiktoken readable format to Qwen2Tokenizer readable format?
Doese the tokenzier used in "https://huggingface.co/Qwen/Qwen1.5-72B" the same with the one provided in "https://huggingface.co/Qwen/Qwen-72B"?
Why the tokenizer is named as Qwen2 while the model is named as Qwen or Qwen1.5? Where are Qwen2 model families?

Thanks in advance.

jklj077 commented 6 months ago

Unfortunately, it is not possible to use the Qwen2Tokenizer class to load from QWenTokenizer files, or vice versa. It is also not recommended to mix the use of Qwen and Qwen2 codes. For better compatibility with the transformers-ecosystem, we advise you to upgrade to Qwen2.
The vocabulary should be considered as the same. The code implementation is different and the signatures of the functions are different in some way.
Qwen1.5 is the beta version of Qwen2, as stated in https://github.com/QwenLM/Qwen1.5#introduction.

QwenLM / Qwen