PaddlePaddle / PaddleRec

Recommendation Algorithm大规模推荐算法库,包含推荐系统经典及最新算法LR、Wide&Deep、DSSM、TDM、MIND、Word2Vec、Bert4Rec、DeepWalk、SSR、AITM,DSIN,SIGN,IPREC、GRU4Rec、Youtube_dnn、NCF、GNN、FM、FFM、DeepFM、DCN、DIN、DIEN、DLRM、MMOE、PLE、ESMM、ESCMM, MAML、xDeepFM、DeepFEFM、NFM、AFM、RALM、DMR、GateNet、NAML、DIFM、Deep Crossing、PNN、BST、AutoInt、FGCNN、FLEN、Fibinet、ListWise、DeepRec、ENSFM,TiSAS,AutoFIS等,包含经典推荐系统数据集criteo 、movielens等
https://paddlerec.readthedocs.io/
Apache License 2.0
4.31k stars 723 forks source link

流式训练demo报错 #923

Open arbitraryking opened 1 year ago

arbitraryking commented 1 year ago

按照doc/online_trainer.md执行命令

(py37) E:\PaddleRec\PaddleRec\models\rank\slot_dnn>fleetrun --server_num=1 --worker_num=1 ../../../tools/static_ps_online_trainer.py -m config_online.yaml

Fatal error in launcher: Unable to create process using '"C:\ProgramData\Anaconda3\conda-bld\paddlepaddle-gpu_1676544693779\_h_env\python.exe"  "D:\Anacony37\Scripts\fleetrun.exe" --server_num=1 --worker_num=1 ../../../tools/static_ps_online_trainer.py -m config_online.yaml': ???????????

我看了下C:\ProgramData\目录下没有Anaconda3,这个python路径没有看到哪里能配置呢

wangzhen38 commented 1 year ago

你本地有安装paddle吗,可以先测试下单机版本能不能跑通

arbitraryking commented 1 year ago

排序模型dnn的单机版本我跑通了,我安装的paddlepaddle:2.1.0,paddlepaddle-gpu:2.4.2.post116 slot_dnn的单机版本报错:

(py37) E:\PaddleRec\PaddleRec\models\rank\slot_dnn>python -u ../../../tools/static_trainer.py -m config_queuedataset.yaml
2023-05-15 16:32:44,707 - INFO - cpu_num: None
2023-05-15 16:32:44,708 - INFO - **************common.configs**********
2023-05-15 16:32:44,708 - INFO - use_gpu: False, use_xpu: False, use_visual: False, train_batch_size: 2, train_data_dir: data/, epochs: 3, print_interval: 10, model_save_path: output_model_benchdnn_queue
2023-05-15 16:32:44,708 - INFO - **************common.configs**********
2023-05-15 16:32:45,986 - INFO - File list: ['E:\\PaddleRec\\PaddleRec\\models\\rank\\slot_dnn\\data//demo_10']
train file_list: ['E:\\PaddleRec\\PaddleRec\\models\\rank\\slot_dnn\\data//demo_10']
parse ins id: None
utils_path: E:\PaddleRec\PaddleRec\tools\utils\static_ps
abs_train_reader is: E:\PaddleRec\PaddleRec\models\rank\slot_dnn\criteo_reader
pipe_command is: python3.7 queuedataset_reader.py config_queuedataset.yaml E:\PaddleRec\PaddleRec\tools\utils\static_ps
dataset init thread_num: 1
2023-05-15 16:32:45,989 - INFO - Get Train Dataset
dataset get_reader thread_num: 1
2023-05-15 16:32:45,996 - INFO - AUC Reset To Zero: _generated_var_0
2023-05-15 16:32:45,996 - INFO - AUC Reset To Zero: _generated_var_1
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_2
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_3
2023-05-15 16:32:45,997 - INFO - AUC Reset To Zero: _generated_var_4
device worker program id: 2348362127944
I0515 16:32:46.040287  4596 hogwild_worker.cc:270] worker 0 train cost 0 seconds, batch_num: 0
2023-05-15 16:32:46,048 - INFO - epoch: 0 done, epoch time: 0.05 s
Traceback (most recent call last):
  File "../../../tools/static_trainer.py", line 315, in <module>
    main(args)
  File "../../../tools/static_trainer.py", line 207, in main
    prefix='rec_static')
  File "E:\PaddleRec\PaddleRec\tools\utils\save_load.py", line 61, in save_static_model
    paddle.static.save(program, model_prefix)
  File "D:\Anaconda\envs\py37\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\framework.py", line 558, in __impl__
    return func(*args, **kwargs)
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1876, in save
    param_dict = {p.name: get_tensor(p) for p in parameter_list}
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1876, in <dictcomp>
    param_dict = {p.name: get_tensor(p) for p in parameter_list}
  File "D:\Anaconda\envs\py37\lib\site-packages\paddle\fluid\io.py", line 1872, in get_tensor
    t = global_scope().find_var(var.name).get_tensor()
ValueError: (InvalidArgument) The Variable type must be class phi::DenseTensor, but the type it holds is class phi::SelectedRows.
  [Hint: Expected holder_->Type() == VarTypeTrait<T>::kId, but received holder_->Type():8 != VarTypeTrait<T>::kId:7.] (at ..\paddle/fluid/framework/variable.h:58)
wangzhen38 commented 1 year ago

我先复现下,确认后会及时修复哈