PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.8k stars 7.69k forks source link

SVTRv2 微调,运行一段时间后异常 #13864

Open kerry-weic opened 1 week ago

kerry-weic commented 1 week ago

🔎 Search before asking

🐛 Bug (问题描述)

使用最新SVTRv2微调ocr模型,显卡两个4090,cuda版本为:12.4,总轮次设定100轮在84轮时出现异常,异常日志如下:

[2024/09/13 15:19:24] ppocr INFO: epoch: [84/100], global_step: 1637760, lr: 0.000005, CTCLoss: 0.004494, NRTRLoss: 1.212208, loss: 1.216570, avg_reader_cost: 0.00110 s, avg_batch_cost: 0.12387 s, avg_samples: 32.5, ips: 262.36459 samples/s, eta: 11:15:20, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:26] ppocr INFO: epoch: [84/100], global_step: 1637770, lr: 0.000005, CTCLoss: 0.004494, NRTRLoss: 1.212296, loss: 1.217111, avg_reader_cost: 0.00120 s, avg_batch_cost: 0.12292 s, avg_samples: 34.0, ips: 276.59449 samples/s, eta: 11:15:19, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:28] ppocr INFO: epoch: [84/100], global_step: 1637780, lr: 0.000005, CTCLoss: 0.004444, NRTRLoss: 1.212281, loss: 1.216786, avg_reader_cost: 0.00110 s, avg_batch_cost: 0.12285 s, avg_samples: 34.0, ips: 276.76621 samples/s, eta: 11:15:17, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
LAUNCH INFO 2024-09-13 15:24:17,583 Pod failed
LAUNCH ERROR 2024-09-13 15:24:17,589 Container failed !!!
Container rank 0 status failed cmd ['/usr/local/bin/python3', '-u', 'tools/train.py', '-c', 'configs/rec/SVTRv2/rec_svtrv2_ch.yml', '-o', 'Global.pretrained_model=./pretrained_model/openatom_rec_svtrv2_ch_train/best_accuracy'] code 1 log log/workerlog.0 
env {'HOSTNAME': 'yfb235', 'TERM': 'xterm', 'OLDPWD': '/', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*
.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm
=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:
*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'PWD': '/home/ocr_v4/PaddleOCR', 'S
HLVL': '1', 'HOME': '/root', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', '_': '/usr/bin/nohup', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'LD_LIBRARY_PATH': '/usr/local/lib/python3.8/site-packages/cv2/../../lib64:', 'POD_NAME': 'imphwz', 'PADDLE_MASTER': '192.168
.1.235:40816', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '192.168.1.235:40817', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE'
: '0', 'PADDLE_TRAINER_ENDPOINTS': '192.168.1.235:40817,192.168.1.235:40818', 'FLAGS_selected_gpus': '0', 'PADDLE_LOG_DIR': '/home/ocr_v4/PaddleOCR/log'}
LAUNCH INFO 2024-09-13 15:24:17,589 ------------------------- ERROR LOG DETAIL -------------------------
LAUNCH INFO 2024-09-13 15:24:22,105 Exit code 1
[2024/09/13 15:19:30] ppocr INFO: epoch: [84/100], global_step: 1637790, lr: 0.000005, CTCLoss: 0.004708, NRTRLoss: 1.212157, loss: 1.217187, avg_reader_cost: 0.00123 s, avg_batch_cost: 0.12193 s, avg_samples: 41.5, ips: 340.36747 samples/s, eta: 11:15:16, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:32] ppocr INFO: epoch: [84/100], global_step: 1637800, lr: 0.000005, CTCLoss: 0.003930, NRTRLoss: 1.212085, loss: 1.217250, avg_reader_cost: 0.00099 s, avg_batch_cost: 0.12466 s, avg_samples: 31.0, ips: 248.66681 samples/s, eta: 11:15:15, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:34] ppocr INFO: epoch: [84/100], global_step: 1637810, lr: 0.000005, CTCLoss: 0.002831, NRTRLoss: 1.212209, loss: 1.214916, avg_reader_cost: 0.00098 s, avg_batch_cost: 0.12473 s, avg_samples: 37.0, ips: 296.64451 samples/s, eta: 11:15:13, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:36] ppocr INFO: epoch: [84/100], global_step: 1637820, lr: 0.000005, CTCLoss: 0.002747, NRTRLoss: 1.211871, loss: 1.214335, avg_reader_cost: 0.00071 s, avg_batch_cost: 0.12376 s, avg_samples: 32.5, ips: 262.61120 samples/s, eta: 11:15:12, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:38] ppocr INFO: epoch: [84/100], global_step: 1637830, lr: 0.000005, CTCLoss: 0.002181, NRTRLoss: 1.211892, loss: 1.214002, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.11974 s, avg_samples: 34.0, ips: 283.94729 samples/s, eta: 11:15:11, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:39] ppocr INFO: epoch: [84/100], global_step: 1637832, lr: 0.000005, CTCLoss: 0.002181, NRTRLoss: 1.211888, loss: 1.213867, avg_reader_cost: 0.00002 s, avg_batch_cost: 0.02332 s, avg_samples: 7.1, ips: 304.47423 samples/s, eta: 11:15:11, max_mem_rese
rved: 22274 MB, max_mem_allocated: 14044 MB
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 826, in __next__
    self._reader.read_next_list()[0]
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 255, in <module>
    main(config, device, logger, vdl_writer, seed)
  File "tools/train.py", line 208, in main
    program.train(
  File "/home/ocr_v4/PaddleOCR/tools/program.py", line 305, in train
    for idx, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 852, in __next__
    self._try_shutdown_all()
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 585, in _try_shutdown_all
    w.join(timeout)
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/lib/python3.8/site-packages/paddle/io/multiprocess_utils.py", line 133, in __handler__
    core._throw_error_if_process_failed()
SystemError: (Fatal) DataLoader process (pid 31227) exited is killed by signal: Killed. (at ../paddle/fluid/imperative/data_loader.cc:189)

I0913 15:24:03.993690 63261 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I0913 15:24:14.507228 63309 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
 max_mem_allocated: 14044 MB
 [2024/09/13 15:19:34] ppocr INFO: epoch: [84/100], global_step: 1637810, lr: 0.000005, CTCLoss: 0.002831, NRTRLoss: 1.212209, loss: 1.214916, avg_reader_cost: 0.00098 s, avg_batch_cost: 0.12473 s, avg_samples: 37.0, ips: 296.64451 samples/s, eta: 11:15:13, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:36] ppocr INFO: epoch: [84/100], global_step: 1637820, lr: 0.000005, CTCLoss: 0.002747, NRTRLoss: 1.211871, loss: 1.214335, avg_reader_cost: 0.00071 s, avg_batch_cost: 0.12376 s, avg_samples: 32.5, ips: 262.61120 samples/s, eta: 11:15:12, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:38] ppocr INFO: epoch: [84/100], global_step: 1637830, lr: 0.000005, CTCLoss: 0.002181, NRTRLoss: 1.211892, loss: 1.214002, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.11974 s, avg_samples: 34.0, ips: 283.94729 samples/s, eta: 11:15:11, max_mem_res
erved: 22274 MB, max_mem_allocated: 14044 MB
[2024/09/13 15:19:39] ppocr INFO: epoch: [84/100], global_step: 1637832, lr: 0.000005, CTCLoss: 0.002181, NRTRLoss: 1.211888, loss: 1.213867, avg_reader_cost: 0.00002 s, avg_batch_cost: 0.02332 s, avg_samples: 7.1, ips: 304.47423 samples/s, eta: 11:15:11, max_mem_rese
rved: 22274 MB, max_mem_allocated: 14044 MB
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 826, in __next__
    self._reader.read_next_list()[0]
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 255, in <module>
    main(config, device, logger, vdl_writer, seed)
  File "tools/train.py", line 208, in main
    program.train(
  File "/home/ocr_v4/PaddleOCR/tools/program.py", line 305, in train
    for idx, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 852, in __next__
    self._try_shutdown_all()
  File "/usr/local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 585, in _try_shutdown_all
    w.join(timeout)
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/lib/python3.8/site-packages/paddle/io/multiprocess_utils.py", line 133, in __handler__
    core._throw_error_if_process_failed()
SystemError: (Fatal) DataLoader process (pid 31227) exited is killed by signal: Killed. (at ../paddle/fluid/imperative/data_loader.cc:189)

I0913 15:24:03.993690 63261 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I0913 15:24:14.507228 63309 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop

🏃‍♂️ Environment (运行环境)

paddleocr main分支代码(Tue Aug 20 15:45:57 2024),最后拉取git id:1752c56cb75cec5ba12a80e028461efacaddc314
paddlepaddle-gpu       2.6.1.post117

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python3 -m paddle.distributed.launch --gpus '0,1'  tools/train.py -c configs/rec/SVTRv2/rec_svtrv2_ch.yml -o Global.pretrained_model=./pretrained_model/openatom_rec_svtrv2_ch_train/best_accuracy
jingsongliujing commented 1 week ago

建议你试试单卡试试,我这边用单卡没问题

kerry-weic commented 1 week ago

之前双卡训练几十万数据量的时候也正常,这次过了百万。但感觉应该不是数据量的问题导致的把,可能是一个偶发的问题

Topdu commented 1 week ago

这好像是卡间NCCL通信问题。对于多卡4090,一般使用需要加上环境变量:NCCL_P2P_DISABLE=1 具体的命令为 NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 python3 -m paddle.distributed.launch --gpus '0,1' tools/train.py -c configs/rec/SVTRv2/rec_svtrv2_ch.yml -o Global.pretrained_model=./pretrained_model/openatom_rec_svtrv2_ch_train/best_accuracy

kerry-weic commented 1 week ago

我执行试试,多谢~