InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.64k stars 297 forks source link

internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢? #805

Open sxk000 opened 1 month ago

sxk000 commented 1 month ago

首先,感谢上海人工智能实验室及其成员对书生模型、代码框架、技术经验的分享!

请问,internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢?

比如:breed_name、area_name等,当做一个token。

谢谢!

hhaAndroid commented 1 month ago

y一个不需要改代码的做法是: 只需要在配置里面加一下就行了

ADD_TOKENS_DECODER={
 "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92538": {
      "content": "<|plugin|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92539": {
      "content": "<|interpreter|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92540": {
      "content": "<|action_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92541": {
      "content": "<|action_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92542": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92543": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
   # 这里加入新的,确保对应 id 没有被用到就行
   "92535": {
        "content": "breed_name",
        "lstrip": False,
        "normalized": False,
        "rstrip": False,
        "single_word": False,
        "special": True
    },
   "92536": {
        "content": "area_name",
        "lstrip": False,
        "normalized": False,
        "rstrip": False,
        "single_word": False,
        "special": True
    },
}
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    added_tokens_decoder=ADD_TOKENS_DECODER,
    padding_side='right')

但是需要注意,qlora 默认是不会训练 embeding 层的,因此不知道对性能有多少影响

HIT-cwh commented 1 month ago

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)
KooSung commented 1 month ago

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')
sxk000 commented 1 month ago

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

谢谢!这种方法有效!

sxk000 commented 1 month ago

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)

感谢回复! 按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下: image


Traceback (most recent call last):                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>                                       
    main()                                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main                                           
    runner.train()                                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train                           
    self.strategy.prepare(                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare                              
    model = self.build_model(model)                                                                                                                       
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model                               
    model = MODELS.build(model)                                                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build                                  
    return self.build_func(cfg, *args, **kwargs, registry=self)                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg            
    return build_from_cfg(cfg, registry, default_args)                                                                                                    
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg                  
    obj = obj_cls(**args)  # type: ignore                                                                                                                 
TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer'
``` `

请问应该怎么解决呢?
sxk000 commented 1 month ago

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

按照这种方法训练出来的模型,对结果有影响。

比如训练数据是: user:你是谁? assistant:我是猴子请来的救兵!

模型训练出来以后,测试结果会出现如下情况: user:你是谁? assistant:你是谁你是谁啊。

如果不加词表,是可以按照训练数据那样正常输出的!

我的操作步骤是:通过如下代码,把原来的模型扩充词表,然后保存tokenizer和model,最后通过扩充词表后保存的模型进行微调训练的。

from transformers import AutoTokenizer,AutoModel

def new_token():
    pretrained_model_name_or_path = '/apply/model/original/internlm2-chat-20b'
    token_file='/apply/data/finetune/token.txt'
    with open(token_file,'r',encoding='utf8') as f:
        token_list=f.readlines()
    token_list=''.join(token_list).split('\n')
    print(token_list)
    tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
    model = AutoModel.from_pretrained(pretrained_model_name_or_path)
    print('---1',tokenizer)
    for token_one in token_list:
        if token_one not in tokenizer.get_vocab():
            tokenizer.add_tokens([token_one],special_tokens=True)
    model.resize_token_embeddings(len(tokenizer))
    print('---2',tokenizer)
    tokenizer.save_pretrained(pretrained_model_name_or_path+'-new')
    model.save_pretrained(pretrained_model_name_or_path+'-new')
HIT-cwh commented 1 month ago

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)

感谢回复! 按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下: image

Traceback (most recent call last):                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>                                       
    main()                                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main                                           
    runner.train()                                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train                           
    self.strategy.prepare(                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare                              
    model = self.build_model(model)                                                                                                                       
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model                               
    model = MODELS.build(model)                                                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build                                  
    return self.build_func(cfg, *args, **kwargs, registry=self)                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg            
    return build_from_cfg(cfg, registry, default_args)                                                                                                    
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg                  
    obj = obj_cls(**args)  # type: ignore                                                                                                                 
TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer'
``` `

请问应该怎么解决呢?

请问用的是xtuenr的什么版本呢

HIT-cwh commented 1 month ago

我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。

sxk000 commented 4 weeks ago

我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。

感谢解答!

最近在忙忙其他的事情了,没有及时回复您,非常抱歉!

xtuenr0.1.14

使用的时全参微调,config代码如下:

tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16))