QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
3.28k stars 203 forks source link

Confusion about the effect of special tokens on model fine-tuning. #537

Open Davidwhw opened 1 week ago

Davidwhw commented 1 week ago

求教 special token 在 LLM 训练中发挥的作用? What role does special token play in LLM training? 在 Qwen2-VL 的 tokenizer.json 文件中存在着诸如以下展示的 <|box_start|><|box_end|> 为了下游任务中指定的 special token,在Qwen2-VL 的论文中也解释了他们的作用。 In the tokenizer.json file of Qwen2-VL, special tokens such as <|box_start|> and <|box_end|> are present, designated for specific tasks downstream. The functions of these tokens are also elucidated in the paper associated with Qwen2-VL.

    {
      "id": 151648,
      "content": "<|box_start|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 151649,
      "content": "<|box_end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },

在我通过微调Qwen2-VL的模型来完成一个模型过去从未接触过的领域下游任务时,我对 special token 的使用产生以下的几点疑惑: When fine-tuning the Qwen2-VL model to accomplish a downstream task in a domain that the model has not previously encountered, I have developed several questions regarding the use of special tokens:

  1. 这类和下游任务相关的 special token 是否是为了高效地引导一个通用模型完成指定任务而设计的?也就是说使用 special token 是否意味着使用更少的训练数据就能够让模型快速的学会一个新的下游任务? Are these special tokens related to downstream tasks designed to efficiently guide a general model to accomplish specific tasks? In other words, does the use of special tokens imply that less training data is required for the model to rapidly learn a new downstream task?
  2. 对于一个模型过去从未接触过的领域下游任务,是否有必要在我的 question-response 形式的训练集的中加入一个针对该任务的 special token? For a domain-specific downstream task that the model has never encountered before, is it necessary to incorporate a special token tailored for this task into my question-response format training dataset?
  3. 如果针对一个新的任务需要自定义一个对应的 special token,以下的两种方式有什么区别? If a custom special token needs to be defined for a new task, what are the differences between the following two methods?
    • 仅仅在指令中直接以文本的简单形式加入我设计好的 token <|xxxxx|>,而不改动 tokenizer 的设置 Directly inserting the designed token <|xxxxx|> in text form into the instructions without modifying the settings of the tokenizer.
    • 像 Qwen2-VL 的论文中所做的在 tokenizer.json 文件的 added_tokens 中加入额外的 token <|xxxxx|> Adding the additional token <|xxxxx|> to the added_tokens section of the tokenizer.json file, as done in the Qwen2-VL paper.