PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.12k stars 5.55k forks source link

NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:101) #40904

Open franztao opened 2 years ago

franztao commented 2 years ago

为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:


C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1648110033 (unix time) try "date -d @1648110033" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 17012 (TID 0x7fd3fd2dc240) from PID 0 ]

workerlog.1: Traceback (most recent call last): File "/home/jovyan/work/knowledgegraphcommon/business/table_structure_recognition/tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py", line 942, in tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr() File "/home/jovyan/work/knowledgegraphcommon/business/table_structure_recognition/tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py", line 934, in tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.predict() File "/home/jovyan/work/knowledgegraphcommon/business/table_structure_recognition/tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py", line 430, in predict return_table_structurer, elapse = table_structurer(input_im_img) File "/root/anaconda3/envs/knowledgegraphcommon-py3.8/lib/python3.8/site-packages/paddleocr/ppstructure/table/predict_structure.py", line 81, in call self.input_tensor.copy_from_cpu(img) File "/root/anaconda3/envs/knowledgegraphcommon-py3.8/lib/python3.8/site-packages/paddle/fluid/inference/wrapper.py", line 35, in tensor_copy_from_cpu self.copy_from_cpu_bind(data) NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:101)

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

franztao commented 2 years ago

`

(knowledgegraphcommon-py3.8) root@knowledgegraphcommon0305-0:~# python3 -m paddle.distributed.launch /home/jovyan/work/knowledgegraphcommon/business/table_structure_recognition/tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py ----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: None heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: /home/jovyan/work/knowledgegraphcommon/business/table_structure_recognition/tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py training_script_args: [] worker_num: None workers:

WARNING 2022-03-24 08:55:48,340 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2022-03-24 08:55:48,341 launch_utils.py:525] Local start 4 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:50503 | | PADDLE_TRAINERS_NUM 4 | | PADDLE_TRAINER_ENDPOINTS ... 0.1:54457,127.0.0.1:36431,127.0.0.1:39079| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,2,3 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+

INFO 2022-03-24 08:55:48,341 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:25788 idx:0 launch proc_id:25793 idx:1 launch proc_id:25799 idx:2 launch proc_id:25805 idx:3 [03-24 08:55:55]-[25788]-[INFO]-[tsr_tlr_tablemaster_tgr_paddlleocr_tgl_paddlleocr.py]-[data_post_process]-[line:375]: param Namespace(benchmark=False, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir=None, cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='/root/.paddleocr/2.4.0.3/ocr/det/en/en_ppocr_mobile_v2.0_det_infer', det_pse_box_thresh=0.85, det_pse_box_type='box', det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, gpu_mem=500, help='==SUPPRESS==', image_dir=None, ir_optim=True, label_list=['0', '180'], label_map_path='./vqa/labels/labels_ser.txt', lang='en', layout_label_map=None, layout_path_model='lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config', max_batch_size=10, max_seq_length=512, max_text_length=25, min_subgraph_size=15, mode='structure', model_name_or_path=None, ocr_version='PP-OCRv2', output='./output', precision='fp32', process_id=0, rec=True, rec_algorithm='CRNN', rec_batch_num=6, rec_char_dict_path='/home/jovyan/code/PaddleOCR/ppocr/utils/en_dict.txt', rec_image_shape='3, 32, 320', rec_model_dir='/root/.paddleocr/2.4.0.3/ocr/rec/en/en_number_mobile_v2.0_rec_infer', save_crop_res=False, save_log_path='./log_output/', show_log=True, structure_version='STRUCTURE', table_char_dict_path='/home/jovyan/code/PaddleOCR/ppocr/utils/dict/table_structure_dict.txt', table_char_type='en', table_max_len=488, table_model_dir='/root/.paddleocr/2.4.0.3/ocr/table/en_ppocr_mobile_v2.0_table_structure_infer', total_process_num=1, type='ocr', use_angle_cls=False, use_dilation=False, use_gpu=True, use_mp=False, use_onnx=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)


C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1648112156 (unix time) try "date -d @1648112156" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 25788 (TID 0x7f1518e27240) from PID 0 ]

INFO 2022-03-24 08:56:01,498 launch_utils.py:341] terminate all the procs ERROR 2022-03-24 08:56:01,499 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log. INFO 2022-03-24 08:56:05,501 launch_utils.py:341] terminate all the procs INFO 2022-03-24 08:56:05,501 launch.py:311] Local processes completed.

`

Baibaifan commented 2 years ago

看看有安装正确gpu版本的paddle吗?

Baibaifan commented 2 years ago

同时确定安装的paddle是否有编译分布式相关功能

franztao commented 2 years ago

这个怎么确定?

franztao commented 2 years ago

看看有安装正确gpu版本的paddle吗? 我用官网的安装命令,python -m pip install paddlepaddle-gpu==2.2.2 -i https://mirror.baidu.com/pypi/simple

franztao commented 2 years ago

可以单独试试https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/all_reduce_cn.html#all-reduce 简单的分布式通信是不是可以使用,确定下paddle是否正确。

WARNING: Ignoring invalid distribution -addlepaddle (/root/anaconda3/envs/knowledgegraphcommon-py3.8/lib/python3.8/site-packages) WARNING: Ignoring invalid distribution -addlepaddle (/root/anaconda3/envs/knowledgegraphcommon-py3.8/lib/python3.8/site-packages) 这个时说明没装好gpu distribution 吗?

Baibaifan commented 2 years ago

大概率是你使用的paddle,编译时候没有编译分布式的相关功能,可以尝试自己编译下paddle 把这个选项打开-DWITH_DISTRIBUTE=ON

franztao commented 2 years ago

大概率是你使用的paddle,编译时候没有编译分布式的相关功能,可以尝试自己编译下paddle 把这个选项打开-DWITH_DISTRIBUTE=ON

有哪里下载编译好的吗?我自己编译各种问题,这又引起其他问题了。。

franztao commented 2 years ago

大概率是你使用的paddle,编译时候没有编译分布式的相关功能,可以尝试自己编译下paddle 把这个选项打开-DWITH_DISTRIBUTE=ON

有哪里下载编译好的吗?我自己编译各种问题,这又引起其他问题了。。

另外,我的环境时单机多卡

franztao commented 2 years ago

大概率是你使用的paddle,编译时候没有编译分布式的相关功能,可以尝试自己编译下paddle 把这个选项打开-DWITH_DISTRIBUTE=ON

有哪里下载编译好的吗?我自己编译各种问题,这又引起其他问题了。。

另外,我的环境时单机多卡

image 这引出太多问题了。。。