是否能支持 huggingface/tokenizers

Liangdi commented 12 months ago

最近使用 candle , 想做 Yi 系列的支持，candle 使用 https://github.com/huggingface/tokenizers 这个库，使用时候需要一个 tokenizer.json , 在 Yi 系列中没有这个文件，一些其他模型如：https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。看了一下 transformer 文档，似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候， candle 那边回复如下，不知道 Yi 系列是否能够支持？ candle issue: https://github.com/huggingface/candle/issues/1177#issuecomment-1789550037

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

ZhaoFancy commented 12 months ago

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

Liangdi commented 11 months ago

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

期待能支持，顺便问一下，国内有微信技术交流群嘛？

ZhaoFancy commented 11 months ago

国内有微信技术交流群嘛？

目前没有，可以在这里投票： https://github.com/01-ai/Yi/discussions/51

loofahcus commented 11 months ago

最近使用 candle , 想做 Yi 系列的支持，candle 使用 https://github.com/huggingface/tokenizers 这个库，使用时候需要一个 tokenizer.json , 在 Yi 系列中没有这个文件，一些其他模型如：https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。看了一下 transformer 文档，似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候， candle 那边回复如下，不知道 Yi 系列是否能够支持？ candle issue: huggingface/candle#1177 (comment)

transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

我研究一下 tokenizer.json 的问题，稍等～谢谢

loofahcus commented 11 months ago

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

Liangdi commented 11 months ago

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

@loofahcus 我用不同的中英文测试数据测试了，和 python 的一致的结果, 太棒了，可以发布转换脚本吗？我这边尝试使用 candle 适配 Yi-6B 去

ericzhou571 commented 11 months ago

请教一下converter相比于llama的Converter都做了哪些修改呢？https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098 我们在llama的基础上将转换脚本里的speical token都改成了Yi的，中文字符的tokenize结果都是准的，但是在面对whitespace的时候还是跟原生Yitokenizer结果不一致

ericzhou571 commented 11 months ago

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

另外请教一下，使用transfomrers的fast tokenizer加载的时候，应该使用哪一个class呢？直接使用PreTrainedTokenizerFast嘛？ 🥹

loofahcus commented 11 months ago

class YiConverter(SpmConverter):
    handle_byte_fallback = True

    def decoder(self, replacement, add_prefix_space):
        return decoders.Sequence(
            [
                decoders.Replace("▁", " "),
                decoders.ByteFallback(),
                decoders.Fuse(),
            ]
        )

    def tokenizer(self, proto):
        model_type = proto.trainer_spec.model_type
        vocab_scores = self.vocab(proto)
        if model_type == 1:
            import tokenizers

            if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
                tokenizer = Tokenizer(Unigram(vocab_scores, 0))
            else:
                tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))

        elif model_type == 2:
            _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
            bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
            tokenizer = Tokenizer(
                BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
            )
            tokenizer.add_special_tokens(
                [
                    AddedToken("<unk>", normalized=False, special=True),
                    AddedToken("<|startoftext|>", normalized=False, special=True),
                    AddedToken("<|endoftext|>", normalized=False, special=True),
                ]
            )
        else:
            raise Exception(
                "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
            )

        return tokenizer

    def normalizer(self, proto):
        return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])

    def pre_tokenizer(self, replacement, add_prefix_space):
        return None

@Liangdi @ericzhou571 供参考，谢谢

loofahcus commented 11 months ago

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

Liangdi commented 11 months ago

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

感谢，我们这边已经着手做 candle 的支持，你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀，这样其他开发者就可以直接使用了

01-ai / Yi

是否能支持 huggingface/tokenizers #24