OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
https://internvl.github.io/
MIT License
3.91k stars 299 forks source link

Why fast tokenizer is disabled? #301

Open dyang415 opened 1 week ago

dyang415 commented 1 week ago

Hi there, nice work on the internVL! We're really impressed by the new internvl-v1.5.

One thing we noticed is that the backing language model internlm/internlm2-chat-20b has a fast tokenizer (https://huggingface.co/internlm/internlm2-chat-20b/blob/main/tokenizer_config.json#L89). However, in internvl, the faster tokenizer was removed (https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5/blob/main/tokenizer_config.json#L162). We're wondering if there's any specific reason the faster tokenizer isn't enabled?

Weiyun1025 commented 5 days ago

We previously discovered that the tokenize results of FastTokenizer sometimes differed from those of Tokenizer. Considering that the benefits of FastTokenizer in our scenario are not significant, we decided not to use FastTokenizer to ensure the correctness of the code.