Closed cqray1990 closed 1 year ago
贴一下退出时1卡上的日志看看
@WenmuZhou 单独一张卡强制Ctrl+C 的日志?
@WenmuZhou 不太懂你的意思,1卡训练的日志再上面有呢,能说下具体操作么
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
cuda10.0 cudnn7,2080ti docker镜像:2.0.1-gpu-cuda10.0-cudnn7 两张卡: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script [2021/08/14 02:13:36] root INFO: Architecture : [2021/08/14 02:13:36] root INFO: Backbone : [2021/08/14 02:13:36] root INFO: disable_se : True [2021/08/14 02:13:36] root INFO: model_name : small [2021/08/14 02:13:36] root INFO: name : MobileNetV3 [2021/08/14 02:13:36] root INFO: scale : 1.0 [2021/08/14 02:13:36] root INFO: Head : [2021/08/14 02:13:36] root INFO: hidden_size : 256 [2021/08/14 02:13:36] root INFO: l2_decay : 1e-05 [2021/08/14 02:13:36] root INFO: loc_type : 2 [2021/08/14 02:13:36] root INFO: name : TableAttentionHead [2021/08/14 02:13:36] root INFO: algorithm : TableAttn [2021/08/14 02:13:36] root INFO: model_type : table [2021/08/14 02:13:36] root INFO: Eval : [2021/08/14 02:13:36] root INFO: dataset : [2021/08/14 02:13:36] root INFO: data_dir : /paddle/data/val [2021/08/14 02:13:36] root INFO: label_file_path : /paddle/data/PubTabNet_val.jsonl [2021/08/14 02:13:36] root INFO: name : PubTabDataSet [2021/08/14 02:13:36] root INFO: transforms : [2021/08/14 02:13:36] root INFO: DecodeImage : [2021/08/14 02:13:36] root INFO: channel_first : False [2021/08/14 02:13:36] root INFO: img_mode : BGR [2021/08/14 02:13:36] root INFO: ResizeTableImage : [2021/08/14 02:13:36] root INFO: max_len : 488 [2021/08/14 02:13:36] root INFO: TableLabelEncode : None [2021/08/14 02:13:36] root INFO: NormalizeImage : [2021/08/14 02:13:36] root INFO: mean : [0.485, 0.456, 0.406] [2021/08/14 02:13:36] root INFO: order : hwc [2021/08/14 02:13:36] root INFO: scale : 1./255. [2021/08/14 02:13:36] root INFO: std : [0.229, 0.224, 0.225] [2021/08/14 02:13:36] root INFO: PaddingTableImage : None [2021/08/14 02:13:36] root INFO: ToCHWImage : None [2021/08/14 02:13:36] root INFO: KeepKeys : [2021/08/14 02:13:36] root INFO: keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask'] [2021/08/14 02:13:36] root INFO: loader : [2021/08/14 02:13:36] root INFO: batch_size_per_card : 4 [2021/08/14 02:13:36] root INFO: drop_last : False [2021/08/14 02:13:36] root INFO: num_workers : 0 [2021/08/14 02:13:36] root INFO: shuffle : False [2021/08/14 02:13:36] root INFO: use_shared_memory : True [2021/08/14 02:13:36] root INFO: Global : [2021/08/14 02:13:36] root INFO: cal_metric_during_train : True [2021/08/14 02:13:36] root INFO: character_dict_path : /paddle/PaddleOCR/ppocr/utils/dict/table_structure_dict.txt [2021/08/14 02:13:36] root INFO: character_type : en [2021/08/14 02:13:36] root INFO: checkpoints : None [2021/08/14 02:13:36] root INFO: debug : False [2021/08/14 02:13:36] root INFO: distributed : True [2021/08/14 02:13:36] root INFO: epoch_num : 50 [2021/08/14 02:13:36] root INFO: eval_batch_step : [0, 20000] [2021/08/14 02:13:36] root INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2021/08/14 02:13:36] root INFO: infer_mode : False [2021/08/14 02:13:36] root INFO: log_smooth_window : 20 [2021/08/14 02:13:36] root INFO: max_cell_num : 500 [2021/08/14 02:13:36] root INFO: max_elem_length : 500 [2021/08/14 02:13:36] root INFO: max_text_length : 100 [2021/08/14 02:13:36] root INFO: pretrained_model : None [2021/08/14 02:13:36] root INFO: print_batch_step : 5 [2021/08/14 02:13:36] root INFO: process_cut_num : 0 [2021/08/14 02:13:36] root INFO: process_total_num : 0 [2021/08/14 02:13:36] root INFO: save_epoch_step : 1 [2021/08/14 02:13:36] root INFO: save_inference_dir : None [2021/08/14 02:13:36] root INFO: save_model_dir : ./output/table_mv3/ [2021/08/14 02:13:36] root INFO: use_gpu : True [2021/08/14 02:13:36] root INFO: use_visualdl : False [2021/08/14 02:13:36] root INFO: Loss : [2021/08/14 02:13:36] root INFO: loc_weight : 10000.0 [2021/08/14 02:13:36] root INFO: name : TableAttentionLoss [2021/08/14 02:13:36] root INFO: structure_weight : 100.0 [2021/08/14 02:13:36] root INFO: Metric : [2021/08/14 02:13:36] root INFO: main_indicator : acc [2021/08/14 02:13:36] root INFO: name : TableMetric [2021/08/14 02:13:36] root INFO: Optimizer : [2021/08/14 02:13:36] root INFO: beta1 : 0.9 [2021/08/14 02:13:36] root INFO: beta2 : 0.999 [2021/08/14 02:13:36] root INFO: clip_norm : 5.0 [2021/08/14 02:13:36] root INFO: lr : [2021/08/14 02:13:36] root INFO: learning_rate : 0.001 [2021/08/14 02:13:36] root INFO: name : Adam [2021/08/14 02:13:36] root INFO: regularizer : [2021/08/14 02:13:36] root INFO: factor : 0.0 [2021/08/14 02:13:36] root INFO: name : L2 [2021/08/14 02:13:36] root INFO: PostProcess : [2021/08/14 02:13:36] root INFO: name : TableLabelDecode [2021/08/14 02:13:36] root INFO: Train : [2021/08/14 02:13:36] root INFO: dataset : [2021/08/14 02:13:36] root INFO: data_dir : /paddle/data/train [2021/08/14 02:13:36] root INFO: label_file_path : /paddle/data/PubTabNet_train.jsonl [2021/08/14 02:13:36] root INFO: name : PubTabDataSet [2021/08/14 02:13:36] root INFO: transforms : [2021/08/14 02:13:36] root INFO: DecodeImage : [2021/08/14 02:13:36] root INFO: channel_first : False [2021/08/14 02:13:36] root INFO: img_mode : BGR [2021/08/14 02:13:36] root INFO: ResizeTableImage : [2021/08/14 02:13:36] root INFO: max_len : 488 [2021/08/14 02:13:36] root INFO: TableLabelEncode : None [2021/08/14 02:13:36] root INFO: NormalizeImage : [2021/08/14 02:13:36] root INFO: mean : [0.485, 0.456, 0.406] [2021/08/14 02:13:36] root INFO: order : hwc [2021/08/14 02:13:36] root INFO: scale : 1./255. [2021/08/14 02:13:36] root INFO: std : [0.229, 0.224, 0.225] [2021/08/14 02:13:36] root INFO: PaddingTableImage : None [2021/08/14 02:13:36] root INFO: ToCHWImage : None [2021/08/14 02:13:36] root INFO: KeepKeys : [2021/08/14 02:13:36] root INFO: keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask'] [2021/08/14 02:13:36] root INFO: loader : [2021/08/14 02:13:36] root INFO: batch_size_per_card : 8 [2021/08/14 02:13:36] root INFO: drop_last : True [2021/08/14 02:13:36] root INFO: num_workers : 0 [2021/08/14 02:13:36] root INFO: shuffle : True [2021/08/14 02:13:36] root INFO: use_shared_memory : True [2021/08/14 02:13:36] root INFO: train with paddle 2.0.0 and device CUDAPlace(0) I0814 02:13:36.571619 23248 nccl_context.cc:189] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0 W0814 02:13:36.804096 23248 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.2, Runtime API Version: 10.2 W0814 02:13:36.837823 23248 device_context.cc:372] device: 0, cuDNN Version: 7.6. [2021/08/14 02:13:39] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_train.jsonl [2021/08/14 02:14:12] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_val.jsonl [2021/08/14 02:14:13] root INFO: train dataloader has 31298 iters [2021/08/14 02:14:13] root INFO: valid dataloader has 2279 iters [2021/08/14 02:14:13] root INFO: During the training process, after the 0th iteration, an evaluation is run every 20000 iterations [2021/08/14 02:14:13] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_train.jsonl INFO 2021-08-14 02:14:55,411 launch_utils.py:307] terminate all the procs ERROR 2021-08-14 02:14:55,420 launch_utils.py:545] ABORT!!! Out of all 2 trainers, the trainer process with rank=[1] was aborted. Please check its log. INFO 2021-08-14 02:14:58,421 launch_utils.py:307] terminate all the procs λ lll-MS-7B22 /paddle/PaddleOCR {release/2.2} df -h Filesystem Size Used Avail Use% Mounted on overlay 893G 748G 101G 89% / tmpfs 64M 0 64M 0% /dev tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup shm 64G 0 64G 0% /dev/shm /dev/nvme0n1p4 893G 748G 101G 89% /paddle /dev/sda1 1.9T 1.8T 88G 96% /paddle/data tmpfs 7.8G 12K 7.8G 1% /proc/driver/nvidia /dev/nvme0n1p1 37G 26G 9.9G 72% /usr/bin/nvidia-smi udev 7.8G 0 7.8G 0% /dev/nvidia0 tmpfs 7.8G 0 7.8G 0% /proc/asound tmpfs 7.8G 0 7.8G 0% /proc/acpi tmpfs 7.8G 0 7.8G 0% /proc/scsi tmpfs 7.8G 0 7.8G 0% /sys/firmware
一张卡: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script [2021/08/14 02:23:15] root INFO: Architecture : [2021/08/14 02:23:15] root INFO: Backbone : [2021/08/14 02:23:15] root INFO: disable_se : True [2021/08/14 02:23:15] root INFO: model_name : small [2021/08/14 02:23:15] root INFO: name : MobileNetV3 [2021/08/14 02:23:15] root INFO: scale : 1.0 [2021/08/14 02:23:15] root INFO: Head : [2021/08/14 02:23:15] root INFO: hidden_size : 256 [2021/08/14 02:23:15] root INFO: l2_decay : 1e-05 [2021/08/14 02:23:15] root INFO: loc_type : 2 [2021/08/14 02:23:15] root INFO: name : TableAttentionHead [2021/08/14 02:23:15] root INFO: algorithm : TableAttn [2021/08/14 02:23:15] root INFO: model_type : table [2021/08/14 02:23:15] root INFO: Eval : [2021/08/14 02:23:15] root INFO: dataset : [2021/08/14 02:23:15] root INFO: data_dir : /paddle/data/val [2021/08/14 02:23:15] root INFO: label_file_path : /paddle/data/PubTabNet_val.jsonl [2021/08/14 02:23:15] root INFO: name : PubTabDataSet [2021/08/14 02:23:15] root INFO: transforms : [2021/08/14 02:23:15] root INFO: DecodeImage : [2021/08/14 02:23:15] root INFO: channel_first : False [2021/08/14 02:23:15] root INFO: img_mode : BGR [2021/08/14 02:23:15] root INFO: ResizeTableImage : [2021/08/14 02:23:15] root INFO: max_len : 488 [2021/08/14 02:23:15] root INFO: TableLabelEncode : None [2021/08/14 02:23:15] root INFO: NormalizeImage : [2021/08/14 02:23:15] root INFO: mean : [0.485, 0.456, 0.406] [2021/08/14 02:23:15] root INFO: order : hwc [2021/08/14 02:23:15] root INFO: scale : 1./255. [2021/08/14 02:23:15] root INFO: std : [0.229, 0.224, 0.225] [2021/08/14 02:23:15] root INFO: PaddingTableImage : None [2021/08/14 02:23:15] root INFO: ToCHWImage : None [2021/08/14 02:23:15] root INFO: KeepKeys : [2021/08/14 02:23:15] root INFO: keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask'] [2021/08/14 02:23:15] root INFO: loader : [2021/08/14 02:23:15] root INFO: batch_size_per_card : 4 [2021/08/14 02:23:15] root INFO: drop_last : False [2021/08/14 02:23:15] root INFO: num_workers : 0 [2021/08/14 02:23:15] root INFO: shuffle : False [2021/08/14 02:23:15] root INFO: use_shared_memory : True [2021/08/14 02:23:15] root INFO: Global : [2021/08/14 02:23:15] root INFO: cal_metric_during_train : True [2021/08/14 02:23:15] root INFO: character_dict_path : /paddle/PaddleOCR/ppocr/utils/dict/table_structure_dict.txt [2021/08/14 02:23:15] root INFO: character_type : en [2021/08/14 02:23:15] root INFO: checkpoints : None [2021/08/14 02:23:15] root INFO: debug : False [2021/08/14 02:23:15] root INFO: distributed : False [2021/08/14 02:23:15] root INFO: epoch_num : 50 [2021/08/14 02:23:15] root INFO: eval_batch_step : [0, 20000] [2021/08/14 02:23:15] root INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2021/08/14 02:23:15] root INFO: infer_mode : False [2021/08/14 02:23:15] root INFO: log_smooth_window : 20 [2021/08/14 02:23:15] root INFO: max_cell_num : 500 [2021/08/14 02:23:15] root INFO: max_elem_length : 500 [2021/08/14 02:23:15] root INFO: max_text_length : 100 [2021/08/14 02:23:15] root INFO: pretrained_model : None [2021/08/14 02:23:15] root INFO: print_batch_step : 5 [2021/08/14 02:23:15] root INFO: process_cut_num : 0 [2021/08/14 02:23:15] root INFO: process_total_num : 0 [2021/08/14 02:23:15] root INFO: save_epoch_step : 1 [2021/08/14 02:23:15] root INFO: save_inference_dir : None [2021/08/14 02:23:15] root INFO: save_model_dir : ./output/table_mv3/ [2021/08/14 02:23:15] root INFO: use_gpu : True [2021/08/14 02:23:15] root INFO: use_visualdl : False [2021/08/14 02:23:15] root INFO: Loss : [2021/08/14 02:23:15] root INFO: loc_weight : 10000.0 [2021/08/14 02:23:15] root INFO: name : TableAttentionLoss [2021/08/14 02:23:15] root INFO: structure_weight : 100.0 [2021/08/14 02:23:15] root INFO: Metric : [2021/08/14 02:23:15] root INFO: main_indicator : acc [2021/08/14 02:23:15] root INFO: name : TableMetric [2021/08/14 02:23:15] root INFO: Optimizer : [2021/08/14 02:23:15] root INFO: beta1 : 0.9 [2021/08/14 02:23:15] root INFO: beta2 : 0.999 [2021/08/14 02:23:15] root INFO: clip_norm : 5.0 [2021/08/14 02:23:15] root INFO: lr : [2021/08/14 02:23:15] root INFO: learning_rate : 0.001 [2021/08/14 02:23:15] root INFO: name : Adam [2021/08/14 02:23:15] root INFO: regularizer : [2021/08/14 02:23:15] root INFO: factor : 0.0 [2021/08/14 02:23:15] root INFO: name : L2 [2021/08/14 02:23:15] root INFO: PostProcess : [2021/08/14 02:23:15] root INFO: name : TableLabelDecode [2021/08/14 02:23:15] root INFO: Train : [2021/08/14 02:23:15] root INFO: dataset : [2021/08/14 02:23:15] root INFO: data_dir : /paddle/data/train [2021/08/14 02:23:15] root INFO: label_file_path : /paddle/data/PubTabNet_train.jsonl [2021/08/14 02:23:15] root INFO: name : PubTabDataSet [2021/08/14 02:23:15] root INFO: transforms : [2021/08/14 02:23:15] root INFO: DecodeImage : [2021/08/14 02:23:15] root INFO: channel_first : False [2021/08/14 02:23:15] root INFO: img_mode : BGR [2021/08/14 02:23:15] root INFO: ResizeTableImage : [2021/08/14 02:23:15] root INFO: max_len : 488 [2021/08/14 02:23:15] root INFO: TableLabelEncode : None [2021/08/14 02:23:15] root INFO: NormalizeImage : [2021/08/14 02:23:15] root INFO: mean : [0.485, 0.456, 0.406] [2021/08/14 02:23:15] root INFO: order : hwc [2021/08/14 02:23:15] root INFO: scale : 1./255. [2021/08/14 02:23:15] root INFO: std : [0.229, 0.224, 0.225] [2021/08/14 02:23:15] root INFO: PaddingTableImage : None [2021/08/14 02:23:15] root INFO: ToCHWImage : None [2021/08/14 02:23:15] root INFO: KeepKeys : [2021/08/14 02:23:15] root INFO: keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask'] [2021/08/14 02:23:15] root INFO: loader : [2021/08/14 02:23:15] root INFO: batch_size_per_card : 8 [2021/08/14 02:23:15] root INFO: drop_last : True [2021/08/14 02:23:15] root INFO: num_workers : 0 [2021/08/14 02:23:15] root INFO: shuffle : True [2021/08/14 02:23:15] root INFO: use_shared_memory : True [2021/08/14 02:23:15] root INFO: train with paddle 2.0.0 and device CUDAPlace(0) [2021/08/14 02:23:15] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_train.jsonl [2021/08/14 02:23:42] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_val.jsonl W0814 02:23:43.277877 23439 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.2, Runtime API Version: 10.2 W0814 02:23:43.315845 23439 device_context.cc:372] device: 0, cuDNN Version: 7.6. [2021/08/14 02:23:46] root INFO: train dataloader has 62597 iters [2021/08/14 02:23:46] root INFO: valid dataloader has 2279 iters [2021/08/14 02:23:46] root INFO: During the training process, after the 0th iteration, an evaluation is run every 20000 iterations [2021/08/14 02:23:46] root INFO: Initialize indexs of datasets:/paddle/data/PubTabNet_train.jsonl [2021/08/14 02:24:13] root INFO: epoch: [1/50], iter: 5, lr: 0.001000, loss: 328.004364, structure_loss: 204.734497, loc_loss: 113.760284, acc: 0.000000, reader_cost: 0.14664 s, batch_cost: 1.50999 s, samples: 48, ips: 6.35764 [2021/08/14 02:24:18] root INFO: epoch: [1/50], iter: 10, lr: 0.001000, loss: 247.173004, structure_loss: 141.424469, loc_loss: 109.553673, acc: 0.000000, reader_cost: 0.00007 s, batch_cost: 1.06152 s, samples: 40, ips: 7.53636 [2021/08/14 02:24:24] root INFO: epoch: [1/50], iter: 15, lr: 0.001000, loss: 225.967239, structure_loss: 120.528061, loc_loss: 100.567551, acc: 0.000000, reader_cost: 0.00009 s, batch_cost: 1.08256 s, samples: 40, ips: 7.38989 [2021/08/14 02:24:29] root INFO: epoch: [1/50], iter: 20, lr: 0.001000, loss: 196.118774, structure_loss: 110.934448, loc_loss: 80.627327, acc: 0.000000, reader_cost: 0.00008 s, batch_cost: 1.04725 s, samples: 40, ips: 7.63903