SkyworkAI / Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc. 天工系列模型在3.2TB高质量多语言和代码数据上进行预训练。我们开源了模型参数,训练数据,评估数据,评估方法。
Other
1.21k stars 111 forks source link

legacy behaviour of the <SkyworkTokenizer'> This means that tokens that come after special tokens will not be properly handled. #16

Closed ericzhou571 closed 10 months ago

ericzhou571 commented 10 months ago

When loading tokenizer with transformers.AutoTokenizer we receive a warning: You are using the legacy behaviour of the <class 'transformers_modules.Skywork.Skywork-13B-base.98a59dec44df3a8fd8fcd4bac07e94db35219eb1.tokenization_skywork.SkyworkTokenizer'> This means that tokens that come after special tokens will not be properly handled.

We already update transformers from 4.31.0 to 4.34.0, but we face the same warning in both version. 截屏2023-11-01 19 16 53

ericzhou571 commented 10 months ago

Does any body face the same problem?

zhao1iang commented 10 months ago

You can set legacy = False, as shown below, when loading the tokenizer to ignore this warning. Our testing shows no difference in effect.

tokenizer = AutoTokenizer.from_pretrained(
        "skywork-tokenizer-path", legacy=False, use_fast=False
    )