PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.89k stars 2.91k forks source link

models里面的bert训练时异常退出(很大可能是显存不足),报0xC0000409 #3306

Open wang001 opened 4 years ago

wang001 commented 4 years ago

尝试过bert-wwm(哈工大提供的),在keras上面可以跑batch_size=6,max_seq=512。 但是,在这里面只能跑batch=1,seq_len=300(我试过) 版本、环境信息: 1)PaddlePaddle版本:1.5.0.post87 2)CPU:i5 9400f 3)GPU:1080ti,通过anaconda安装的cudatoolkit 8.0,cudnn7.1.4 4)系统环境:win10专业版,64位,Python版本 3.6.8

sneaxiy commented 4 years ago

能否提供具体的报错信息?

wang001 commented 4 years ago

只有一个这个退出的错误码,应该是因为windows下面的缘故,但是在调小batch_size和seq后可以运行,应该是因为显存不足。

sneaxiy commented 4 years ago

是哪个退出的错误码?能否提供下?

wang001 commented 4 years ago

0xC0000409

sneaxiy commented 4 years ago

请问还有什么其他错误现象吗?比如退出时的界面或堆栈信息?

wang001 commented 4 years ago

没有,windows下面一直没有支持堆栈吧,如果你了解nlp的话,可以随便拿个数据集试试吧

wang001 commented 4 years ago

E:\Anaconda3\envs\paddle\python.exe E:/worksapce/pyWorkSpace/PaddleNLP_0903/run_Sentiment_Analysis.py --kfold 10 --use_cuda true --batch_size 1 --in_tokens false --init_pretraining_params F:/pretrain/chinese_wwm_ext_L-12_paddle --data_path E:/corpus/chinaMobile/sentiment_pair_raw_fold_left512.tsv --vocab_path F:/pretrain/chinese_wwm_ext_L-12_H-768_A-12/vocab.txt --checkpoints F:/pretrain/checkpoints_roberta --save_steps 1000 --weight_decay 0.01 --warmup_proportion 0.0 --validation_steps 100 --epoch 1 --max_seq_len 512 --bert_config_path F:/pretrain/chinese_wwm_ext_L-12_H-768_A-12/bert_config.json --learning_rate 1e-5 --skip_steps 10 --num_iteration_per_drop_scope 10 --verbose true ----------- Configuration Arguments ----------- batch_size: 1 bert_config_path: F:/pretrain/chinese_wwm_ext_L-12_H-768_A-12/bert_config.json checkpoints: F:/pretrain/checkpoints_roberta data_path: E:/corpus/chinaMobile/sentiment_pair_raw_fold_left512.tsv do_lower_case: True enable_ce: False epoch: 1 in_tokens: False init_checkpoint: None init_pretraining_params: F:/pretrain/chinese_wwm_ext_L-12_paddle kfold: 10 learning_rate: 1e-05 loss_scaling: 1.0 lr_scheduler: linear_warmup_decay max_seq_len: 512 num_iteration_per_drop_scope: 10 random_seed: 0 save_steps: 1000 shuffle: True skip_steps: 10 use_cuda: True use_fast_executor: False use_fp16: False validation_steps: 100 verbose: True vocab_path: F:/pretrain/chinese_wwm_ext_L-12_H-768_A-12/vocab.txt warmup_proportion: 0.0 weight_decay: 0.01

attention_probs_dropout_prob: 0.1 directionality: bidi hidden_act: gelu hidden_dropout_prob: 0.1 hidden_size: 768 initializer_range: 0.02 intermediate_size: 3072 max_position_embeddings: 512 num_attention_heads: 12 num_hidden_layers: 12 pooler_fc_size: 768 pooler_num_attention_heads: 12 pooler_num_fc_layers: 3 pooler_size_per_head: 128 pooler_type: first_token_transform type_vocab_size: 2 vocab_size: 21128

Device count: 1 Num train examples: 6619 Max train steps: 6619 Num warmup steps: 0 Theoretical memory usage in training: 6787.632 - 7110.853 MB Load pretraining parameters from F:/pretrain/chinese_wwm_ext_L-12_paddle. WARNING:root: You can try our memory optimize feature to save your memory usage:

create a build_strategy variable to set memory optimize option

     build_strategy = compiler.BuildStrategy()
     build_strategy.enable_inplace = True
     build_strategy.memory_optimize = True

     # pass the build_strategy to with_data_parallel API
     compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
         loss_name=loss.name, build_strategy=build_strategy)

 !!! Memory optimize is our experimental feature !!!
     some variables may be removed/reused internal to save memory usage, 
     in order to fetch the right value of the fetch_list, please set the 
     persistable property to true for each variable in fetch_list

     # Sample
     conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
     # if you need to fetch conv1, then:
     conv1.persistable = True

train pyreader queue size: 50, learning rate: 0.000010 epoch: 0, progress: 54/6619, step: 0, ave loss: 0.7352414727210999, ave acc: 1.0 Check failure stack trace:

Process finished with exit code -1073740791 (0xC0000409)