PaddlePaddle / PaddleRec

Recommendation Algorithm大规模推荐算法库,包含推荐系统经典及最新算法LR、Wide&Deep、DSSM、TDM、MIND、Word2Vec、Bert4Rec、DeepWalk、SSR、AITM,DSIN,SIGN,IPREC、GRU4Rec、Youtube_dnn、NCF、GNN、FM、FFM、DeepFM、DCN、DIN、DIEN、DLRM、MMOE、PLE、ESMM、ESCMM, MAML、xDeepFM、DeepFEFM、NFM、AFM、RALM、DMR、GateNet、NAML、DIFM、Deep Crossing、PNN、BST、AutoInt、FGCNN、FLEN、Fibinet、ListWise、DeepRec、ENSFM,TiSAS,AutoFIS等,包含经典推荐系统数据集criteo 、movielens等
https://paddlerec.readthedocs.io/
Apache License 2.0
4.27k stars 720 forks source link

[使用问题] paddlecloud分布式训练demo报错 #259

Open tjufc opened 3 years ago

tjufc commented 3 years ago

问题概述:参照distributed_train.md教程提交paddlecloud训练任务。demo和配置同教程所述,配置采用K8S集群的Collective模式配置。任务运行没有输出,日志有报错。

任务详情

安装paddle-rec(run.log显示安装成功)

# before_hook.sh
pip install paddle-rec==1.8.5.1
pip uninstall -y paddlepaddle
python -m pip install paddlepaddle-gpu==1.8.5.post107 -i https://mirror.baidu.com/pypi/simple

报错信息(workerlog)

# /env_run/logs/workerlog.0
...
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: Deprec
ationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses           
  import imp                                                                                                                     
TensorRT dynamic library (libnvinfer.so) that Paddle depends on is not configured correctly. (error code is libnvinfer.so: cannot
 open shared object file: No such file or directory)                                                                             
  Suggestions:                                                                                                                   
  1. Check if TensorRT is installed correctly and its version is matched with paddlepaddle you installed.                        
  2. Configure TensorRT dynamic library environment variables as follows:                                                        
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`                                                                   
  - Windows: set PATH by `set PATH=XXX;PaddleRec: Runner collective_cluster Begin                                                
PADDLEREC_CLUSTER_TYPE: K8S                                                                                                      
PaddleRec run on device GPU: 0                                                                                                   
Executor Mode: train                                                                                                             
processor_register begin                                                                                                         
Running CollectiveInstance.                                                                                                      
Running CollectiveNetwork.                                                                                                       
Traceback (most recent call last):                                                                                               
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainer.py", line 255, in run                    
    self.context_process(self._context)                                                                                          
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainer.py", line 216, in context_process        
    self._status_processor[context['status']](context)                                                                           
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/general_trainer.py", line 90, in network
    network_class.build_network(context)                                                                                         
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/framework/network.py", line 392, in buil
d_network                                                                                                                        
    model._data_loader)                                                                                                          
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/framework/dataset.py", line 67, in get_d
ataloader
    "", dataset_name, context["config_yaml"], context)                                                                           
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/utils/dataloader_instance.py", line 115, in slotd
ataloader_by_name                                                                                                                
    hidden_file_list=[], data_file_list=[], train_data_path=data_path)                                                           
TypeError: cannot unpack non-iterable NoneType object                                                                            
Catch Exception:cannot unpack non-iterable NoneType object                                                                       
                                                                                                                                 
--------------------------------                                                                                                 
PaddleRec Error Message Summary:                                                                                                 
--------------------------------                                                                                                 
                                                                                                                                 
Exit PaddleRec. catch exception in precoss status: [network_pass], except: cannot unpack non-iterable NoneType object            
TypeError

run.log日志

selected_gpus:range(0, 1)                                                                                                        
use_paddlecloud_flag:True                                                                                                        
node_ips:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724,job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b5
8f-a0369f713724                                                                                                                  
node_ip:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724                                                      
node_rank:0                                                                                                                      
num_nodes: 2                                                                                                                     
cluster:job_server:None pods:["rank:0 id:None addr:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None 
visible_gpu:[] trainers:['gpu:[0] endpoint:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:0']", "
rank:1 id:None addr:job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None visible_gpu:[] trainers:['gpu:[
0] endpoint:job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:1']"] job_stage_flag:None hdfs:None   
pod:rank:0 id:None addr:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None visible_gpu:[] trainers:['g
pu:[0] endpoint:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:0']                               
~/paddlejob/workspace                                                                                                            
2020-11-11 12:02:00 [INFO] [/root/paddlejob/run.sh: 251] [start_user_end_hook_process] end_hook start ...                        
~/paddlejob/workspace/env_run ~/paddlejob/workspace                                                                              
Run before_hook.sh ...                                                                                                           
~/paddlejob/workspace                                                                                                            
~/paddlejob/workspace ~/paddlejob/workspace                                                                                      
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 14] [start_umount_afs] starting umount afs                        
no need umount afs                                                                                                               
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 21] [start_umount_afs] finished umount afs                        
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 30] [data_clean] data_clear start ...                             
~/paddlejob/workspace                                                                                                            
2020-11-11 12:02:00 [INFO] [/root/paddlejob/run.sh: 554] [taks_allreduce_mode] trainer successed.                                
k8s job finished
gentelyang commented 3 years ago

目前k8s collective模式只支持单机多卡,多机多卡训练还在开发中。