baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.
https://huggingface.co/baichuan-inc/baichuan-7B
Apache License 2.0
5.67k stars 506 forks source link

[Question] 两个小坑:没有pad_token && 把没必要的buffer保存下来了 #66

Open zlkqz opened 1 year ago

zlkqz commented 1 year ago

Required prerequisites

Questions

  1. 在训练时,需要padding,但是pad_token_id并没有跟着tokenizer一起保存下来

    >>> tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B")
    >>> print(tokenizer.pad_token_id, tokenizer.pad_token)
    Output: None, None

    如果不指定pad_token_id会触发Cuda Error越界问题或者pad_token not found

  2. https://github.com/baichuan-inc/baichuan-7B/blob/main/models/modeling_baichuan.py 里96行可以看到,注册到buffer的inv_freq跟着state_dict()一起保存下来了,但是这个在推理时完全没用!如果用model.named_parameters()保存模型参数,会报错没有找到model.model.layers.31.self_attn.rotary_emb.inv_freq

Checklist

zlkqz commented 1 year ago
  1. 第一个问题:将pad_token_id设为0

    >>> tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B")
    >>> tokenizer.pad_token_id = 0
  2. 第二个问题:将该行代码加一个persistent=False参数

    >>> self.register_buffer("inv_freq", inv_freq, persistent=False)
TangoW commented 1 year ago
  1. 第一个问题:将pad_token_id设为0
>>> tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B")
>>> tokenizer.pad_token_id = 0
  1. 第二个问题:将该行代码加一个persistent=False参数
>>> self.register_buffer("inv_freq", inv_freq, persistent=False)

也可以在special_tokens_map.json中添加一个pad_token,内容直接复制unk_token,即:

"pad_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
yqli2420 commented 1 year ago
  1. 第一个问题:将pad_token_id设为0
>>> tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/baichuan-7B")
>>> tokenizer.pad_token_id = 0
  1. 第二个问题:将该行代码加一个persistent=False参数
>>> self.register_buffer("inv_freq", inv_freq, persistent=False)

改成tokenizer.pad_token_id = 0微调实验,跑着跑着还会出现Cuda Error越界问题