请教关于Tokenizer的问题

HarderThenHarder / transformers_tasks

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.

2.15k stars 380 forks source link

Open magnificent1208 opened 1 year ago

magnificent1208 commented 1 year ago

自制jsonl中，含有（）这种符号无法识别。我理解，本repo按照bert token的格式来做的，所以具体逻辑可以介绍下吗？感谢

HarderThenHarder commented 1 year ago

Hi，如果您需要扩展 special token 可以尝试下使用下面这种方式：

special_tokens = ['（', '）']
tokenizer.add_tokens(special_tokens, special_tokens=True)
model.resize_token_embeddings(len(tokenizer))