Closed HUSTHY closed 1 year ago
使用paddle2.3.1.post112/cuda11.2/3090显卡/paddlenlp2.3.4/Linux/python3.7环境复现SBert的时候,固定随机种子 paddle.seed(100) random.seed(100) np.random.seed(100) 并且使用FLAGS_cudnn_deterministic = True结果仍然不能复现,每次训练的acc指标都不能loss也不同; 同torch版本对比,在同一个数据集上(paws_x),同一个Bert预训练权重 bert-wwm-ext-chinese下: torch则可以复现,并且torch的acc为0.75(torch不设置随机种子的时候结果也是瞎飞),paddle的acc0.55-0.62之间随机出现; 请问是环境版本问题还是paddle框架问题还是我代码问题?怎么解决?
bert里面有dropout,dataloader的shuffle等随机因素,可以把dropout这些随机因素关掉,然后可以看看能不能达到您的目的
全部拍查过了;原因是梯度回传的时候梯度每次不一样;关闭优化器结果就可以复现; 官方人员说是paddle某些op具有随机性,导致梯度回传更新参数不一样
全部拍查过了;原因是梯度回传的时候梯度每次不一样;关闭优化器结果就可以复现; 官方人员说是paddle某些op具有随机性,导致梯度回传更新参数不一样
dropout,dataloader,batch_sampler这些少数有随机性,其他的都没啥问题,优化器我做过实验,应该是没多大问题的,可以给出代码示例,我们看一下
我把完整的代码都上传到附件paddle_first_demo.zip上了 paddle_first_demo.zip
我把完整的代码都上传到附件paddle_first_demo.zip上了 paddle_first_demo.zip
hi,回复有点晚了。您可以把下面的参数设置成0。
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/text_matching/simcse/train.py
做法请参考这个
谢谢 回复 貌似 这个操作我试过了的 就是把bert config文件中的一些dropout设置参数设置为0相当于把bert结构中的dropout和attention模块中的dropout都关闭了;貌似没有效果
cpu呢?
谢谢 回复 貌似 这个操作我试过了的 就是把bert config文件中的一些dropout设置参数设置为0相当于把bert结构中的dropout和attention模块中的dropout都关闭了;貌似没有效果
好的,那看来还是有随机性,试试cpu,看看cpu能否对齐?
不过我也不确认生效没有 我再去确认一下
cpu 上是没有随机性的 可以确认
还有一个问题 请教一下 如果使用paddle去实现 SIMCSE 算法 bert config hidden_dropout_prob 和 attention_probs_dropout_prob 一定不能为0,那有什么方法可以解决呢?
如果dropout不为0,那我理解应该是不能固定住,dropout本来就会引入随机性
OK 确认了 还是不行 GPU上跑起来效果还是不行 第一次acc=0.5605 第二次acc=0.5655 还是不能复现出一样的结果
总之谢谢你的回复
@HUSTHY
第一次运行的结果:
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
2022-08-22 06:47:49,062 train_sentence_bert.py [line:93] INFO args: Namespace(batch_size=64, epochs=1, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv')
[2022-08-22 06:47:49,063] [ INFO] - Already cached /root/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese-vocab.txt
[2022-08-22 06:47:49,075] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/bert-wwm-ext-chinese/tokenizer_config.json
[2022-08-22 06:47:49,076] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/bert-wwm-ext-chinese/special_tokens_map.json
[2022-08-22 06:47:49,076] [ INFO] - Already cached /root/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese.pdparams
W0822 06:47:49.078204 79030 gpu_context.cc:278] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
W0822 06:47:49.083241 79030 gpu_context.cc:306] device: 1, cuDNN Version: 7.6.
tokenization: 1000it [00:01, 923.86it/s]
tokenization: 1000it [00:01, 936.03it/s]
Epoch 0: CosineAnnealingDecay set learning rate to 1e-05.
2022-08-22 06:48:01,389 train_sentence_bert.py [line:113] INFO ***** Running training *****
2022-08-22 06:48:01,390 train_sentence_bert.py [line:114] INFO Num examples = 16
2022-08-22 06:48:01,390 train_sentence_bert.py [line:115] INFO Num Epochs = 1
[evaldation] 16/16 [==============================] 131.3ms/step loss: 0.6809 Epoch 1: CosineAnnealingDecay set learning rate to 9.755282581475769e-06.
2022-08-22 06:48:10,099 train_sentence_bert.py [line:152] INFO save model
[2022-08-22 06:48:12,171] [ INFO] - tokenizer config file saved in ./output/paddle_2022-08-22/tokenizer_config.json
[2022-08-22 06:48:12,172] [ INFO] - Special tokens file saved in ./output/paddle_2022-08-22/special_tokens_map.json
2022-08-22 06:48:12,172 train_sentence_bert.py [line:158] INFO val_acc:0.5590------best_acc:0.5590
第二次运行的结果:
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
2022-08-22 06:49:05,761 train_sentence_bert.py [line:93] INFO args: Namespace(batch_size=64, epochs=1, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv')
[2022-08-22 06:49:05,761] [ INFO] - Already cached /root/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese-vocab.txt
[2022-08-22 06:49:05,774] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/bert-wwm-ext-chinese/tokenizer_config.json
[2022-08-22 06:49:05,775] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/bert-wwm-ext-chinese/special_tokens_map.json
[2022-08-22 06:49:05,775] [ INFO] - Already cached /root/.paddlenlp/models/bert-wwm-ext-chinese/bert-wwm-ext-chinese.pdparams
W0822 06:49:05.777117 79056 gpu_context.cc:278] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
W0822 06:49:05.782191 79056 gpu_context.cc:306] device: 1, cuDNN Version: 7.6.
tokenization: 1000it [00:01, 869.93it/s]
tokenization: 1000it [00:01, 929.98it/s]
Epoch 0: CosineAnnealingDecay set learning rate to 1e-05.
2022-08-22 06:49:18,445 train_sentence_bert.py [line:113] INFO ***** Running training *****
2022-08-22 06:49:18,445 train_sentence_bert.py [line:114] INFO Num examples = 16
2022-08-22 06:49:18,445 train_sentence_bert.py [line:115] INFO Num Epochs = 1
[evaldation] 16/16 [==============================] 131.1ms/step loss: 0.6809 Epoch 1: CosineAnnealingDecay set learning rate to 9.755282581475769e-06.
2022-08-22 06:49:27,084 train_sentence_bert.py [line:152] INFO save model
[2022-08-22 06:49:29,258] [ INFO] - tokenizer config file saved in ./output/paddle_2022-08-22/tokenizer_config.json
[2022-08-22 06:49:29,259] [ INFO] - Special tokens file saved in ./output/paddle_2022-08-22/special_tokens_map.json
2022-08-22 06:49:29,259 train_sentence_bert.py [line:158] INFO val_acc:0.5590------best_acc:0.5590
能够复现,我的环境是:
这是我用全部的数据,不是1000条 第一次训练结果 2022-08-22 12:10:11,619 train_sentence_bert.py [line:91] INFO args: Namespace(batch_size=64, epochs=5, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv') 2022-08-22 12:11:28,773 train_sentence_bert.py [line:111] INFO Running training 2022-08-22 12:11:28,774 train_sentence_bert.py [line:112] INFO Num examples = 768 2022-08-22 12:11:28,775 train_sentence_bert.py [line:113] INFO Num Epochs = 5 2022-08-22 12:14:44,734 train_sentence_bert.py [line:150] INFO save model 2022-08-22 12:14:47,777 train_sentence_bert.py [line:156] INFO val_acc:0.5605------best_acc:0.5605 2022-08-22 12:18:00,986 train_sentence_bert.py [line:156] INFO val_acc:0.5605------best_acc:0.5605 2022-08-22 12:21:15,839 train_sentence_bert.py [line:156] INFO val_acc:0.5550------best_acc:0.5605 2022-08-22 12:24:32,991 train_sentence_bert.py [line:156] INFO val_acc:0.5460------best_acc:0.5605 2022-08-22 12:27:50,582 train_sentence_bert.py [line:156] INFO val_acc:0.5445------best_acc:0.5605
第二次训练结果 2022-08-22 12:45:39,841 train_sentence_bert.py [line:91] INFO args: Namespace(batch_size=64, epochs=5, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv') 2022-08-22 12:46:44,956 train_sentence_bert.py [line:111] INFO Running training 2022-08-22 12:46:44,956 train_sentence_bert.py [line:112] INFO Num examples = 768 2022-08-22 12:46:44,957 train_sentence_bert.py [line:113] INFO Num Epochs = 5 2022-08-22 12:49:47,558 train_sentence_bert.py [line:150] INFO save model 2022-08-22 12:49:50,183 train_sentence_bert.py [line:156] INFO val_acc:0.5655------best_acc:0.5655 2022-08-22 12:53:05,770 train_sentence_bert.py [line:156] INFO val_acc:0.5635------best_acc:0.5655 2022-08-22 12:56:29,093 train_sentence_bert.py [line:156] INFO val_acc:0.5525------best_acc:0.5655 2022-08-22 12:59:52,874 train_sentence_bert.py [line:156] INFO val_acc:0.5555------best_acc:0.5655 2022-08-22 13:03:15,798 train_sentence_bert.py [line:156] INFO val_acc:0.5540------best_acc:0.5655
这是我用1000条数据训练对比 第一次训练结果 2022-08-22 14:59:04,602 train_sentence_bert.py [line:91] INFO args: Namespace(batch_size=64, epochs=5, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv') W0822 14:59:04.653079 74201 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2 W0822 14:59:04.681924 74201 gpu_resources.cc:91] device: 1, cuDNN Version: 8.1. tokenization: 1000it [00:01, 828.41it/s] tokenization: 1000it [00:01, 866.82it/s] Epoch 0: CosineAnnealingDecay set learning rate to 1e-05. 2022-08-22 14:59:13,117 train_sentence_bert.py [line:111] INFO Running training 2022-08-22 14:59:13,117 train_sentence_bert.py [line:112] INFO Num examples = 16 2022-08-22 14:59:13,118 train_sentence_bert.py [line:113] INFO Num Epochs = 5 [evaldation] 16/16 [==============================] 76.7ms/step loss: 0.6914 Epoch 1: CosineAnnealingDecay set learning rate to 9.755282581475769e-06. 2022-08-22 14:59:20,583 train_sentence_bert.py [line:150] INFO save model [2022-08-22 14:59:23,324] [ INFO] - tokenizer config file saved in ./output/paddle_2022-08-22/tokenizer_config.json [2022-08-22 14:59:23,324] [ INFO] - Special tokens file saved in ./output/paddle_2022-08-22/special_tokens_map.json 2022-08-22 14:59:23,325 train_sentence_bert.py [line:156] INFO val_acc:0.5800------best_acc:0.5800 [evaldation] 16/16 [==============================] 77.6ms/step loss: 0.6702 Epoch 2: CosineAnnealingDecay set learning rate to 9.045084971874738e-06. 2022-08-22 14:59:27,950 train_sentence_bert.py [line:156] INFO val_acc:0.5790------best_acc:0.5800 [evaldation] 16/16 [==============================] 77.8ms/step loss: 0.6428 Epoch 3: CosineAnnealingDecay set learning rate to 7.938926261462366e-06. 2022-08-22 14:59:32,587 train_sentence_bert.py [line:156] INFO val_acc:0.5640------best_acc:0.5800 [evaldation] 16/16 [==============================] 79.0ms/step loss: 0.6212 Epoch 4: CosineAnnealingDecay set learning rate to 6.545084971874738e-06. 2022-08-22 14:59:37,367 train_sentence_bert.py [line:156] INFO val_acc:0.5510------best_acc:0.5800 [evaldation] 16/16 [==============================] 79.2ms/step loss: 0.6161 Epoch 5: CosineAnnealingDecay set learning rate to 5e-06. 2022-08-22 14:59:42,155 train_sentence_bert.py [line:156] INFO val_acc:0.5540------best_acc:0.5800
第二次训练结果 2022-08-22 15:00:54,722 train_sentence_bert.py [line:91] INFO args: Namespace(batch_size=64, epochs=5, lr=1e-05, max_len=64, model_out='./output', pretrained='pretrained_models/paddle/bert-wwm-ext-chinese', task_type='classification', train_file='./data/paws_x/translated_train.tsv', val_file='./data/paws_x/dev_2k.tsv') W0822 15:00:54.733444 75341 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2 W0822 15:00:54.736241 75341 gpu_resources.cc:91] device: 1, cuDNN Version: 8.1. tokenization: 1000it [00:01, 945.41it/s] tokenization: 1000it [00:01, 935.15it/s] Epoch 0: CosineAnnealingDecay set learning rate to 1e-05. 2022-08-22 15:00:59,736 train_sentence_bert.py [line:111] INFO Running training 2022-08-22 15:00:59,736 train_sentence_bert.py [line:112] INFO Num examples = 16 2022-08-22 15:00:59,736 train_sentence_bert.py [line:113] INFO Num Epochs = 5 [evaldation] 16/16 [==============================] 76.9ms/step loss: 0.6915 Epoch 1: CosineAnnealingDecay set learning rate to 9.755282581475769e-06. 2022-08-22 15:01:06,179 train_sentence_bert.py [line:150] INFO save model [2022-08-22 15:01:08,908] [ INFO] - tokenizer config file saved in ./output/paddle_2022-08-22/tokenizer_config.json [2022-08-22 15:01:08,908] [ INFO] - Special tokens file saved in ./output/paddle_2022-08-22/special_tokens_map.json 2022-08-22 15:01:08,909 train_sentence_bert.py [line:156] INFO val_acc:0.5800------best_acc:0.5800 [evaldation] 16/16 [==============================] 75.8ms/step loss: 0.6702 Epoch 2: CosineAnnealingDecay set learning rate to 9.045084971874738e-06. 2022-08-22 15:01:13,462 train_sentence_bert.py [line:156] INFO val_acc:0.5790------best_acc:0.5800 [evaldation] 16/16 [==============================] 81.3ms/step loss: 0.6429 Epoch 3: CosineAnnealingDecay set learning rate to 7.938926261462366e-06. 2022-08-22 15:01:18,329 train_sentence_bert.py [line:156] INFO val_acc:0.5630------best_acc:0.5800 [evaldation] 16/16 [==============================] 77.8ms/step loss: 0.6211 Epoch 4: CosineAnnealingDecay set learning rate to 6.545084971874738e-06. 2022-08-22 15:01:23,064 train_sentence_bert.py [line:156] INFO val_acc:0.5510------best_acc:0.5800 [evaldation] 16/16 [==============================] 80.3ms/step loss: 0.6161 Epoch 5: CosineAnnealingDecay set learning rate to 5e-06. 2022-08-22 15:01:27,962 train_sentence_bert.py [line:156] INFO val_acc:0.5550------best_acc:0.5800
1000条数据的时候 第一个epoch的结果是一样的;全部数据集的时候不一样; 环境和你的环境是一样的
1000条数据的时候 第一个epoch的结果是一样的;全部数据集的时候不一样; 环境和你的环境是一样的
好的,确实存在这个问题,我复现出来了,感谢您的反馈,我向框架的人反馈一下
是框架的问题?还是paddlenlp实现的Bert的问题呀? 之前有人说是梯度回传的问题。。。不知道是不是
可能是transormer的问题,这里面第8章在所有种子固定的情况下,能够完全复现。但最后的transformer有一点点diff, https://aistudio.baidu.com/aistudio/course/introduce/25793
OK 好的 谢谢了 期待后面能解决这个问题
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
已收到!谢谢! ——黄洋
请问一下,最后这个问题解决了吗?
已收到!谢谢! ——黄洋
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
使用paddle2.3.1.post112/cuda11.2/3090显卡/paddlenlp2.3.4/Linux/python3.7环境复现SBert的时候,固定随机种子 paddle.seed(100) random.seed(100) np.random.seed(100) 并且使用FLAGS_cudnn_deterministic = True结果仍然不能复现,每次训练的acc指标都不能loss也不同; 同torch版本对比,在同一个数据集上(paws_x),同一个Bert预训练权重 bert-wwm-ext-chinese下: torch则可以复现,并且torch的acc为0.75(torch不设置随机种子的时候结果也是瞎飞),paddle的acc0.55-0.62之间随机出现; 请问是环境版本问题还是paddle框架问题还是我代码问题?怎么解决?