PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
44.11k stars 7.81k forks source link

使用PP-OCRv3做文字检测微调,在T4上跑训练无法训练,报错返回-9 #13100

Closed xla145 closed 4 months ago

xla145 commented 4 months ago

问题描述 / Problem Description

使用PP-OCRv3做文字检测微调,在T4上跑训练无法训练,报错返回-9

运行环境 / Runtime Environment

复现代码 / Reproduction Code

代码直接使用了官方最新代码,配置文件如下: Global: debug: false use_gpu: true epoch_num: 500 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/ch_PP-OCR_V3_det/ save_epoch_step: 50 eval_batch_step:

Architecture: model_type: det algorithm: DB Transform: Backbone: name: MobileNetV3 scale: 0.5 model_name: large disable_se: True Neck: name: RSEFPN out_channels: 96 shortcut: True Head: name: DBHead k: 50

Loss: name: DBLoss balance_loss: true main_loss_type: DiceLoss alpha: 5 beta: 10 ohem_ratio: 3 Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 warmup_epoch: 2 regularizer: name: L2 factor: 5.0e-05 PostProcess: name: DBPostProcess thresh: 0.3 box_thresh: 0.6 max_candidates: 1000 unclip_ratio: 1.5 Metric: name: DetMetric main_indicator: hmean Train: dataset: name: SimpleDataSet data_dir: ../dataset/handwriting_data_det/train/20240605/train/images/ label_file_list:

完整报错 / Complete Error Message

没有完整的报错,直接提示code -9

可能解决方案 / Possible solutions

目前通过调整num_workers = 0,batch_size_per_card=1 能正常跑几轮,但是效率较低

附件 / Appendix

硬件配置

image

xla145 commented 4 months ago

报错补充:LAUNCH INFO 2024-06-17 17:01:27,787 Pod failed LAUNCH ERROR 2024-06-17 17:01:27,788 Container failed !!! Container rank 3 status failed cmd ['/data/miniconda/envs/py38_ocr/bin/python3', '-u', 'tools/train.py', '-c', 'configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml', '-o', 'Global.pretrained_model=pretrain_models/ch_PP-OCRv3_det_distill_train/student.pdparams'] code -9 log log/workerlog.3 env {'NV_LIBCUBLAS_VERSION': '11.3.1.68-1', 'NVIDIA_VISIBLE_DEVICES': 'GPU-1314610d-0066-5daa-2856-78e48d9c6b8f,GPU-4408b4f1-7084-1763-3194-0190b4d7e396,GPU-4b5437d0-12bf-5148-3c67-1eeee3c02a2b,GPU-97e8a70d-1660-3d0c-f53e-f7d6e12d27ac', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'TRAEFIK_WEB_SERVICE_PORT_80_TCP_PROTO': 'tcp', 'ZHANGZHIYING_IDE_PORT_8080_TCP_PROTO': 'tcp', 'NEXUS3_SERVICE_PORT_WEB': '8081', 'COLORTERM': 'truecolor', 'NV_NVML_DEV_VERSION': '11.2.67-1', 'ZHANGZHIYING_IDE_PORT_8080_TCP': 'tcp://10.10.114.16:8080', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'KUBERNETES_SERVICE_PORT': '443', 'TERM_PROGRAM_VERSION': '1.85.1', 'POSTRESQL_POSTGRESQL_PORT_5432_TCP': 'tcp://10.10.70.210:5432', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.8.4-1+cuda11.2', 'CONDA_EXE': '/data/miniconda/bin/conda', 'TRAEFIK_WEB_SERVICE_PORT_80_TCP_ADDR': '10.10.172.32', '_CE_M': '', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.8.4-1', 'TRAEFIK_WEB_SERVICE_PORT_WEB': '80', 'HOSTNAME': 'xulian-ide-99bdfd69c-lrhk2', 'ZHANGZHIYING_IDE_PORT_8080_TCP_PORT': '8080', 'ZENTAO_SERVICE_PORT': 'tcp://10.10.215.153:80', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.2 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450', 'XULIAN_IDE_PORT_8080_TCP_PORT': '8080', 'TRAEFIK_WEB_PORT_80_TCP_ADDR': '10.10.147.238', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-11-2=11.3.1.68-1', 'TRAEFIK_WEB_PORT': 'tcp://10.10.147.238:80', 'NV_NVTX_VERSION': '11.2.67-1', 'NV_ML_REPO_ENABLED': '1', 'ZENTAO_SERVICE_SERVICE_HOST': '10.10.215.153', 'NEXUS3_SERVICE_HOST': '10.10.17.156', 'NEXUS3_PORT': 'tcp://10.10.17.156:8081', 'ZHANGZHIYING_IDE_SERVICE_HOST': '10.10.114.16', 'NV_CUDA_CUDART_DEV_VERSION': '11.2.72-1', 'NV_LIBCUSPARSE_VERSION': '11.3.1.68-1', 'NV_LIBNPP_VERSION': '11.2.1.68-1', 'POSTRESQL_POSTGRESQL_SERVICE_PORT': '5432', 'NCCL_VERSION': '2.8.4-1', 'VSCODE_PROXY_URI': 'https://xulian.test.bytebroad.com/proxy/{{port}}/', 'TRAEFIK_WEB_PORT_80_TCP_PORT': '80', 'ZHANGZHIYING_IDE_PORT_8080_TCP_ADDR': '10.10.114.16', 'XULIAN_IDE_PORT': 'tcp://10.10.231.36:8080', 'ZHOUPAN_IDE_PORT': 'tcp://10.10.95.128:8080', 'PWD': '/root/workspace/paddleocr_train', 'CONDA_ROOT': '/data/miniconda', 'XULIAN_IDE_PORT_8080_TCP': 'tcp://10.10.231.36:8080', 'CONDA_PREFIX': '/data/miniconda/envs/py38_ocr', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.1.1.33-1+cuda11.2', 'ZENTAO_SERVICE_PORT_80_TCP_ADDR': '10.10.215.153', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NV_LIBNPP_PACKAGE': 'libnpp-11-2=11.2.1.68-1', 'TRAEFIK_DASHBOARD_SERVICE_PORT_DASHBOARD': '8080', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'VSCODE_GIT_ASKPASS_NODE': '/root/code-server/lib/node', 'TRAEFIK_WEB_PORT_443_TCP_PROTO': 'tcp', 'NV_LIBCUBLAS_DEV_VERSION': '11.3.1.68-1', 'TRAEFIK_DASHBOARD_PORT_8080_TCP_PORT': '8080', 'TRAEFIK_DASHBOARD_PORT_8080_TCP': 'tcp://10.10.150.194:8080', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-11-2', 'POSTRESQL_POSTGRESQL_PORT_5432_TCP_PORT': '5432', 'NV_CUDA_CUDART_VERSION': '11.2.72-1', 'HOME': '/root', 'ZENTAO_SERVICE_PORT_80_TCP_PROTO': 'tcp', 'LANG': 'en_US.UTF-8', 'KUBERNETES_PORT_443_TCP': 'tcp://10.10.0.1:443', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'NEXUS3_PORT_8081_TCP': 'tcp://10.10.17.156:8081', 'ZHOUPAN_IDE_SERVICE_PORT_HTTP': '8080', 'CUDA_VERSION': '11.2.0', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-11-2=11.3.1.68-1', 'ZHOUPAN_IDE_PORT_8080_TCP_ADDR': '10.10.95.128', 'TRAEFIK_DASHBOARD_SERVICE_SERVICE_HOST': '10.10.210.90', 'CONDA_PROMPT_MODIFIER': '(py38_ocr) ', 'XULIAN_IDE_SERVICE_HOST': '10.10.231.36', 'GIT_ASKPASS': '/root/code-server/lib/vscode/extensions/git/dist/askpass.sh', 'ZHOUPAN_IDE_SERVICE_PORT': '8080', 'ZHANGZHIYING_IDE_PORT': 'tcp://10.10.114.16:8080', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-11-2=11.2.1.68-1', 'XULIAN_IDE_PORT_8080_TCP_ADDR': '10.10.231.36', 'TRAEFIK_WEB_PORT_80_TCP': 'tcp://10.10.147.238:80', 'TRAEFIK_DASHBOARD_SERVICE_PORT_8080_TCP_PORT': '8080', 'TRAEFIK_DASHBOARD_PORT': 'tcp://10.10.150.194:8080', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-11-2', 'ZENTAO_SERVICE_PORT_80_TCP': 'tcp://10.10.215.153:80', 'TRAEFIK_WEB_SERVICE_HOST': '10.10.147.238', 'NV_LIBNPP_DEV_VERSION': '11.2.1.68-1', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'TRAEFIK_WEB_SERVICE_PORT_80_TCP': 'tcp://10.10.172.32:80', 'NEXUS3_SERVICE_PORT': '8081', 'TRAEFIK_DASHBOARD_PORT_8080_TCP_ADDR': '10.10.150.194', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'TERM': 'xterm-256color', 'NV_ML_REPO_URL': 'https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64', 'NV_LIBCUSPARSE_DEV_VERSION': '11.3.1.68-1', '_CE_CONDA': '', 'TRAEFIK_WEB_SERVICE_SERVICE_PORT': '80', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'TRAEFIK_WEB_SERVICE_PORT': '80', 'ZHOUPAN_IDE_PORT_8080_TCP_PORT': '8080', 'TRAEFIK_WEB_SERVICE_PORT_80_TCP_PORT': '80', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'NV_CUDNN_VERSION': '8.1.1.33', 'VSCODE_GIT_IPC_HANDLE': '/tmp/vscode-git-e94dce1104.sock', 'CONDA_SHLVL': '3', 'ZHOUPAN_IDE_PORT_8080_TCP_PROTO': 'tcp', 'POSTRESQL_POSTGRESQL_SERVICE_HOST': '10.10.70.210', 'POSTRESQL_POSTGRESQL_PORT_5432_TCP_PROTO': 'tcp', 'TRAEFIK_DASHBOARD_SERVICE_HOST': '10.10.150.194', 'SHLVL': '2', 'POSTRESQL_POSTGRESQL_PORT_5432_TCP_ADDR': '10.10.70.210', 'NV_CUDA_LIB_VERSION': '11.2.0-1', 'NVARCH': 'x86_64', 'TRAEFIK_DASHBOARD_SERVICE_PORT': 'tcp://10.10.210.90:8080', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'ZHANGZHIYING_IDE_SERVICE_PORT_HTTP': '8080', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.1.1.33-1+cuda11.2', 'TRAEFIK_WEB_SERVICE_SERVICE_HOST': '10.10.172.32', 'KUBERNETES_PORT_443_TCP_ADDR': '10.10.0.1', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-11-2', 'ZHOUPAN_IDE_SERVICE_HOST': '10.10.95.128', 'ZENTAO_SERVICE_PORT_80_TCP_PORT': '80', 'CONDA_PYTHON_EXE': '/data/miniconda/bin/python', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.8.4-1+cuda11.2', 'LD_LIBRARY_PATH': '/data/miniconda/envs/py38_ocr/lib/python3.8/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'POSTRESQL_POSTGRESQL_SERVICE_PORT_TCP_POSTGRESQL': '5432', 'ZENTAO_SERVICE_SERVICE_PORT': '80', 'TRAEFIK_WEB_PORT_443_TCP': 'tcp://10.10.147.238:443', 'CONDA_DEFAULT_ENV': 'py38_ocr', 'XULIAN_IDE_SERVICE_PORT_HTTP': '8080', 'NEXUS3_PORT_8081_TCP_PROTO': 'tcp', 'KUBERNETES_SERVICE_HOST': '10.10.0.1', 'TRAEFIK_WEB_PORT_80_TCP_PROTO': 'tcp', 'ZHANGZHIYING_IDE_SERVICE_PORT': '8080', 'KUBERNETES_PORT': 'tcp://10.10.0.1:443', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'VSCODE_GIT_ASKPASS_MAIN': '/root/code-server/lib/vscode/extensions/git/dist/askpass-main.js', 'TRAEFIK_DASHBOARD_SERVICE_PORT_8080_TCP_ADDR': '10.10.210.90', 'TRAEFIK_WEB_SERVICE_PORT_WEBSECURE': '443', 'BROWSER': '/root/code-server/lib/vscode/bin/helpers/browser.sh', 'PATH': '/data/miniconda/envs/py38_ocr/bin:/data/miniconda/condabin:/data/miniconda/envs/py38/bin:/data/miniconda/condabin:/root/code-server/lib/vscode/bin/remote-cli:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/miniconda/envs/py38/bin:/data/miniconda/bin:/root/code-server/bin:/root/clangd_17.0.3/bin:/root/golang/bin', 'NODE_EXEC_PATH': '/root/code-server/lib/node', 'XULIAN_IDE_PORT_8080_TCP_PROTO': 'tcp', 'TRAEFIK_DASHBOARD_SERVICE_PORT_8080_TCP_PROTO': 'tcp', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_LIBNCCL_PACKAGE_VERSION': '2.8.4-1', 'XULIAN_IDE_SERVICE_PORT': '8080', 'NEXUS3_PORT_8081_TCP_ADDR': '10.10.17.156', 'TRAEFIK_DASHBOARD_PORT_8080_TCP_PROTO': 'tcp', 'CONDA_PREFIX_1': '/data/miniconda', 'TRAEFIK_DASHBOARD_SERVICE_PORT_8080_TCP': 'tcp://10.10.210.90:8080', 'CONDA_PREFIX_2': '/data/miniconda/envs/py38', 'TRAEFIK_DASHBOARD_SERVICE_SERVICE_PORT': '8080', 'TRAEFIK_WEB_PORT_443_TCP_PORT': '443', 'POSTRESQL_POSTGRESQL_PORT': 'tcp://10.10.70.210:5432', 'NEXUS3_PORT_8081_TCP_PORT': '8081', 'OLDPWD': '/root/workspace', 'TRAEFIK_WEB_PORT_443_TCP_ADDR': '10.10.147.238', 'TERM_PROGRAM': 'vscode', 'ZHOUPAN_IDE_PORT_8080_TCP': 'tcp://10.10.95.128:8080', 'VSCODE_IPC_HOOKCLI': '/tmp/vscode-ipc-11f41773-ceee-4dec-8b14-b37683de6c18.sock', '': '/data/miniconda/envs/py38_ocr/bin/python3', 'LC_CTYPE': 'C.UTF-8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/data/miniconda/envs/py38_ocr/lib/python3.8/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/data/miniconda/envs/py38_ocr/lib/python3.8/site-packages/cv2/qt/fonts', 'POD_NAME': 'cvxshq', 'PADDLE_MASTER': '10.20.59.202:39009', 'PADDLE_GLOBAL_SIZE': '4', 'PADDLE_LOCAL_SIZE': '4', 'PADDLE_GLOBAL_RANK': '3', 'PADDLE_LOCAL_RANK': '3', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '10.20.59.202:39013', 'PADDLE_TRAINER_ID': '3', 'PADDLE_TRAINERS_NUM': '4', 'PADDLE_RANK_IN_NODE': '3', 'PADDLE_TRAINER_ENDPOINTS': '10.20.59.202:39010,10.20.59.202:39011,10.20.59.202:39012,10.20.59.202:39013', 'FLAGS_selected_gpus': '3', 'PADDLE_LOG_DIR': '/root/workspace/paddleocr_train/log'} LAUNCH INFO 2024-06-17 17:01:27,789 ------------------------- ERROR LOG DETAIL ------------------------- LAUNCH INFO 2024-06-17 17:01:35,863 Exit code -9

jingsongliujing commented 4 months ago

你进'/root/workspace/paddleocr_train/log'看看,报错日志应该在这里面

xla145 commented 4 months ago

日志没有报错,只是提示code -9

jingsongliujing commented 4 months ago

code -9 就是异常退出了,真正的log你要cd到/root/workspace/paddleocr_train/log这个目录下去看