PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.9k stars 2.91k forks source link

fp16 训练分类模型 hang 住 #2755

Open mzchtx opened 5 years ago

mzchtx commented 5 years ago

环境

问题1

utility.py 此处缩进有问题,多了一个空格,会报错:

IndentationError: unexpected indent 

问题2

train.py 中已经没有 input_dtype 参数,而在 run.sh 中还大量存在

问题3

ResNet50_vd + fp16=True + use_label_smoothing=True 出错:

命令如下:

python train.py \                                                                                                                                                                                     
        --model=ResNet50_vd \                                                                                                                                                                         
        --batch_size=256 \                                                                                                                                                                            
        --fp16=True \                                                                                                                                                                                 
        --total_images=1281167 \                                                                                                                                                                      
        --image_shape=3,224,224 \                                                                                                                                                                     
        --class_dim=1000 \                                                                                                                                                                            
        --lr_strategy=cosine_decay \                                                                                                                                                                  
        --lr=0.1 \                                                                                                                                                                                    
        --num_epochs=200 \                                                                                                                                                                            
        --with_mem_opt=False \                                                                                                                                                                         
        --model_save_dir=output/ \                                                                                                                                                                    
        --l2_decay=7e-5 \                                                                                                                                                                             
        --use_mixup=True \                                                                                                                                                                            
        --use_label_smoothing=True \                                                                                                                                                                  
        --label_smoothing_epsilon=0.1 

错误信息:

Traceback (most recent call last):                                                                                                                                                                    
  File "train.py", line 655, in <module>                                                                                                                                                              
    main()                                                                                                                                                                                            
  File "train.py", line 651, in main                                                                                                                                                                  
    train(args)                                                                                                                                                                                       
  File "train.py", line 530, in train                                                                                                                                                                 
    loss, lr = train_exe.run(fetch_list=train_fetch_list)                                                                                                                                             
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 280, in run                                                                                                   
    return_numpy=return_numpy)                                                                                                                                                                        
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/executor.py", line 665, in run                                                                                                            
    return_numpy=return_numpy)                                                                                                                                                                        
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/executor.py", line 527, in _run_parallel                                                                                                  
    exe.run(fetch_var_names, fetch_var_name)                                                                                                                                                          
paddle.fluid.core_avx.EnforceNotMet: Invoke operator cross_entropy error.                                                                                                                             
Python Callstacks:                                                                                                                                                                                    
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1748, in append_op                                                                                                    
    attrs=kwargs.get("attrs", None))                                                                                                                                                                  
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op                                                                                                   
    return self.main_program.current_block().append_op(*args, **kwargs)                                                                                                                               
  File "/usr/local/lib/python2.7/site-packages/paddle/fluid/layers/nn.py", line 1547, in cross_entropy                                                                                                
    "ignore_index": ignore_index})                                                                                                                                                                    
  File "train.py", line 235, in calc_loss                                                                                                                                                             
    loss = fluid.layers.cross_entropy(input=softmax_out, label=smooth_label, soft_label=True)                                                                                                         
  File "train.py", line 275, in net_config                                                                                                                                                            
    loss_a = calc_loss(epsilon,y_a,class_dim,softmax_out,use_label_smoothing)                                                                                                                         
  File "train.py", line 332, in build_program                                                                                                                                                         
    avg_cost = net_config(image=image, y_a=y_a, y_b=y_b, lam=lam, model=model, args=args, label=0, is_train=True)                                                                                     
  File "train.py", line 401, in train                                                                                                                                                                 
    args=args)                                                                                                                                                                                        
  File "train.py", line 651, in main                                                                                                                                                                  
    train(args)                                                                                                                                                                                       
  File "train.py", line 655, in <module>                                                                                                                                                              
    main()                                                                                                                                                                                            
C++ Callstacks: 

问题4

ResNet50_vd + fp16=True + use_label_smoothing=False hang 住:

命令如下:

python train.py \                                                                                                                                                                                     
        --model=ResNet50_vd \                                                                                                                                                                         
        --batch_size=256 \                                                                                                                                                                            
        --fp16=True \                                                                                                                                                                                 
        --total_images=1281167 \                                                                                                                                                                      
        --image_shape=3,224,224 \                                                                                                                                                                     
        --class_dim=1000 \                                                                                                                                                                            
        --lr_strategy=cosine_decay \                                                                                                                                                                  
        --lr=0.1 \                                                                                                                                                                                    
        --num_epochs=200 \                                                                                                                                                                            
        --with_mem_opt=False \                                                                                                                                                                         
        --model_save_dir=output/ \                                                                                                                                                                    
        --l2_decay=7e-5 \                                                                                                                                                                             
        --use_mixup=True \                                                                                                                                                                            
        --use_label_smoothing=False \                                                                                                                                                                  
        --label_smoothing_epsilon=0.1 
sandyhouse commented 5 years ago

问题1,2已修复。

sandyhouse commented 5 years ago

问题3我在本地没有复现,能否提供更多的信息 @mzchtx

JiaXiao243 commented 5 years ago

问题3我在本地没有复现,能否提供更多的信息 @mzchtx

执行python train.py --use_label_smoothing true --model=ResNet50_vd --num_epochs=1 --batch_size 128 --fp16 true,即可复现该问题