Tongjilibo / bert4torch

An elegent pytorch implement of transformers
https://bert4torch.readthedocs.io/
MIT License
1.21k stars 153 forks source link

使用AutoModel替代build_transformer_model,发现其参数在训练过程中不会被更新 #140

Open ZayIsAllYouNeed opened 1 year ago

ZayIsAllYouNeed commented 1 year ago

我使用 AutoModel.from_pretrained 来替代 build_transformer_model(config_path, checkpoint_path) 作为backbone后,发现模型训练过程不会更新backbone的参数( requires_grad=True),而其他的加上的 linear 层还是正常更新的。 请问能提示下是哪里的问题吗?

ZayIsAllYouNeed commented 1 year ago

另外,我用convert_deberta_v2.py将预训练模型参数名改好后,用build_transformer_model加载,训练完成后,发现train_model.bert.embeddings.word_embeddings.weight参数“没”被更新,其他的层有更新(比如train_model.bert.encoderLayer[0].multiHeadAttention.o.weight)

Tongjilibo commented 1 year ago

我刚刚没隔一些step打印权重的sum(),从打印结果看是有变动的,只是变动的幅度和别的比是小了一点

2023-07-02 21:48:03 - Start Training
2023-07-02 21:48:03 - Epoch: 1/10
   9/1129 [..............................] - ETA: 8:15 - loss: 0.7241 - accuracy: 0.5417 [embedding]:  -11801.388671875  [o.weight]:  7.179624557495117
  19/1129 [..............................] - ETA: 5:35 - loss: 0.6300 - accuracy: 0.6184 [embedding]:  -11801.509765625  [o.weight]:  7.172887325286865
  29/1129 [..............................] - ETA: 4:48 - loss: 0.6178 - accuracy: 0.6422 [embedding]:  -11801.685546875  [o.weight]:  7.165395736694336
  39/1129 [>.............................] - ETA: 4:28 - loss: 0.5923 - accuracy: 0.6603 [embedding]:  -11801.83203125  [o.weight]:  7.176580429077148
  49/1129 [>.............................] - ETA: 4:15 - loss: 0.5714 - accuracy: 0.6862 [embedding]:  -11801.955078125  [o.weight]:  7.194809436798096
  59/1129 [>.............................] - ETA: 4:03 - loss: 0.5740 - accuracy: 0.6833 [embedding]:  -11801.974609375  [o.weight]:  7.193059921264648
  69/1129 [>.............................] - ETA: 3:57 - loss: 0.5531 - accuracy: 0.7029 [embedding]:  -11801.9990234375  [o.weight]:  7.181048393249512
  79/1129 [=>............................] - ETA: 3:50 - loss: 0.5431 - accuracy: 0.7144 [embedding]:  -11801.96484375  [o.weight]:  7.179488182067871
  89/1129 [=>............................] - ETA: 3:44 - loss: 0.5396 - accuracy: 0.7191 [embedding]:  -11801.923828125  [o.weight]:  7.181138038635254
  99/1129 [=>............................] - ETA: 3:39 - loss: 0.5331 - accuracy: 0.7216 [embedding]:  -11801.8671875  [o.weight]:  7.16298246383667
ZayIsAllYouNeed commented 1 year ago

嗯嗯,是的,loss会下降,但是模型只有一部分参数被更新。 AutoModel.from_pretrained 来替代 build_transformer_model时,得到的backbone(即self.deberta)没有被更新,其他加上的linear有更新。 用build_transformer_model的时候,发现deberta.embeddings.word_embeddings.weight没有被更新,其他attention层有更新

ZayIsAllYouNeed commented 1 year ago

我加载的Erlangshen-DeBERTa-v2-97M-Chinese

ZayIsAllYouNeed commented 1 year ago

请问您这里时使用的build_transformer_model吗?

Tongjilibo commented 1 year ago

我刚刚看的这个example,我打印出来权重是有略微的改变的,那你直接用huggingface的试试看呢,那边是什么情况

Tongjilibo commented 1 year ago

你这样修改看看打印出来是否有变化

class Evaluator(Callback):
    """评估与保存
    """
    def __init__(self):
        self.best_val_acc = 0.

    def on_batch_begin(self, global_step, local_step, logs=None):
        if (global_step+1) % 50 == 0:
            print('[embedding]: ', model.bert.embeddings.word_embeddings.weight[:4,:4].detach())
ZayIsAllYouNeed commented 1 year ago

不好意思,我知道为什么在我这embedding看起来没有变化了: 因为我之前只关注了一些生僻的token的向量,而这些token在训练语料里没有出现,所以这部分的向量没有更新,而其他 语料中出现的token向量是有更新的 。 给您造成了误会,抱歉~

ZayIsAllYouNeed commented 1 year ago

至于之前 AutoModel.from_pretrained 来替代 build_transformer_model时,我看attention层的向量训练前后并没有变化

Tongjilibo commented 1 year ago

嗯嗯,应该是要语料中出现该token,其才会更新到embedding的权重中去

ZayIsAllYouNeed commented 1 year ago

使用 AutoModel.from_pretrained 来替代 build_transformer_model(config_path, checkpoint_path) 作为backbone后,发现模型训练过程不会更新backbone的参数( requires_grad=True),请问这个问题您能帮忙解答下吗? 以下是loss和参数的变化:

bert.encoder.layer[0].attention.output.dense.weight: tensor([[-0.0097, -0.0309, -0.0151, -0.0192], [-0.0226, 0.0237, 0.0011, 0.0200], [ 0.0050, 0.0198, -0.0224, 0.0068], [ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7') 10/31 [========>.....................] - ETA: 8s - loss: 0.5647 - subject_loss: 0.1725 - object_loss: 0.3922 bert.encoder.layer[0].attention.output.dense.weight: tensor([[-0.0097, -0.0309, -0.0151, -0.0192], [-0.0226, 0.0237, 0.0011, 0.0200], [ 0.0050, 0.0198, -0.0224, 0.0068], [ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7') 20/31 [==================>...........] - ETA: 4s - loss: 0.4588 - subject_loss: 0.1573 - object_loss: 0.3015 bert.encoder.layer[0].attention.output.dense.weight: tensor([[-0.0097, -0.0309, -0.0151, -0.0192], [-0.0226, 0.0237, 0.0011, 0.0200], [ 0.0050, 0.0198, -0.0224, 0.0068], [ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7') 30/31 [============================>.] - ETA: 0s - loss: 0.4164 - subject_loss: 0.1510 - object_loss: 0.2654 bert.encoder.layer[0].attention.output.dense.weight: tensor([[-0.0097, -0.0309, -0.0151, -0.0192], [-0.0226, 0.0237, 0.0011, 0.0200], [ 0.0050, 0.0198, -0.0224, 0.0068], [ 0.0352, -0.0158, -0.0098, 0.0337]], device='cuda:7') 31/31 [==============================] - 11s 366ms/step - loss: 0.4136 - subject_loss: 0.1505 - object_loss: 0.2631

Tongjilibo commented 1 year ago

我看loss是下降的,说明肯定有参数更新了,你可以试着记录所有参数层的权重和看看呢,看看哪些层变化了,哪些层没变化

Tongjilibo commented 1 year ago

我感觉这个框架应该没啥关系,用bert4torch或者hf的trainer应该不是导致这个问题的原因

ZayIsAllYouNeed commented 1 year ago

我用的CasRel代码,看参数只更新了self.bert 以外的,如 self.linear1 然后我把加载的模型换成了bert模型,发现就能正常更新; 而之前的deberta v2就不行,并且最终效果很差:

class Model(BaseModel):
    def __init__(self) -> None:
        super().__init__()
        # self.bert = build_transformer_model(config_path, checkpoint_path, model='deberta_v2')
        self.bert = AutoModel.from_pretrained("../../data/bert/Erlangshen-DeBERTa-v2-97M-Chinese")
        self.linear1 = nn.Linear(768, 2)
        self.condLayerNorm = LayerNorm(hidden_size=768, conditional_size=768 * 2)
        self.LayerNorm = LayerNorm(hidden_size=768)
        self.linear2 = nn.Linear(768, len(predicate2id) * 2)

以下是加载bert的打印结果,是正常的:

bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0147, -0.0067, -0.0006, -0.0297],
[ 0.0141, -0.0764, -0.1015, -0.0069],
[-0.0212, 0.0386, -0.0464, -0.0098],
[ 0.0502, 0.0950, -0.0278, -0.0396]], device='cuda:7')
10/31 [========>.....................] - ETA: 15s - loss: 0.6156 - subject_loss: 0.1724 - object_loss: 0.4431
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0148, -0.0065, -0.0003, -0.0295],
[ 0.0140, -0.0765, -0.1016, -0.0070],
[-0.0205, 0.0393, -0.0459, -0.0091],
[ 0.0505, 0.0953, -0.0279, -0.0393]], device='cuda:7')
20/31 [==================>...........] - ETA: 5s - loss: 0.4846 - subject_loss: 0.1588 - object_loss: 0.3258
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0149, -0.0064, -0.0002, -0.0294],
[ 0.0141, -0.0764, -0.1016, -0.0069],
[-0.0203, 0.0395, -0.0458, -0.0089],
[ 0.0506, 0.0953, -0.0279, -0.0392]], device='cuda:7')
30/31 [============================>.] - ETA: 0s - loss: 0.4405 - subject_loss: 0.1542 - object_loss: 0.2863
bert.encoder.layer[0].attention.output.dense.weight:
tensor([[ 0.0150, -0.0064, -0.0001, -0.0294],
[ 0.0141, -0.0764, -0.1016, -0.0069],
[-0.0202, 0.0397, -0.0458, -0.0088],
[ 0.0506, 0.0953, -0.0279, -0.0392]], device='cuda:7')
31/31 [==============================] - 13s 430ms/step - loss: 0.4373 - subject_loss: 0.1535 - object_loss: 0.2837