Closed lhangtk closed 3 years ago
batch_size=8的时候占用率已经80%+了,batch_size=16的时候肯定会cuda out of memory吧
batch_size=8的时候占用率已经80%+了,batch_size=16的时候肯定会cuda内存不足吧
显存超出会报错 率用率超出了就一直100%
batch_size_per_card只能调到8,改成16就报错 这是在8的时候显卡使用情况
报错如下
` def convert_to_list(value, n, name, dtype=np.int): [2021/03/30 06:46:35] root INFO: Architecture: [2021/03/30 06:46:35] root INFO: Backbone : [2021/03/30 06:46:35] 根信息:层数:34 [2021/03/30 06:46:35] 根信息:名称:ResNet [2021/03/30 06:46:35] 根信息:头: [2021/03/30 06:46:35] 根信息:fc_decay:4e-05 [2021/03/30 06:46:35] 根信息:名称:CTCHHead [2021/03/30 06: 46:35] 根信息:颈部: [2021/03/30 06:46:35] 根信息:encodertype:rnn [2021/03/30 06:46:35] 根信息:hiddensize:128 [2021/03/ 30 06:46:35] 根信息:名称:SequenceEncoder [2021/03/30 06:46:35] 根信息:转换:无 [2021/03/30 06:46:35] 根信息:算法: CRNN [2021/03/30 06:46:35] 根信息:model_type:rec [2021/03/30 06:46:35] 根信息:评估: [2021/03/30 06:46:35] 根信息:数据集: [2021/03/30 06:46:35]根信息:数据目录:../ dataset/rec /eval/ [2021/03/30 06:46:35] 根信息:label_file_list:['../dataset / rec / eval / label.txt'] [2021/ 03/30 06:46:35] 根信息:label_name:label.txt [2021/03/30 06:46:35] 根信息:名称:SimpleDataSet [2021/03/30 06:46:35] 根信息:转换: [2021/03/30 06:46:35] 根信息:DecodeImage: [2021/03/30 06:46:35] 根信息:channel_first : False [2021/03/30 06:46:35] 根信息:img_mode:BGR [2021/03/30 06:46:35] 根信息: CTCLabelEncode : 无 [2021/03/30 06:46:35] 根信息: RecResizeImg : [2021/03/30 06:46:35] 根信息: image_shape : [3, 32, 320] [2021/03/ 30 06:46:35] 根信息:KeepKeys: [2021/03/30 06:46:35] 根信息:keep_keys:['图像','标签','长度'] [2021/03/30 06: 46:35] 根信息:加载程序: [2021/03/30 06:46:35] 根信息:batch_size_per_card:8 [2021/03/30 06:46:35] 根信息:drop_last:False [2021/03/ 30 06:46:35] 根信息:num_workers:8 [2021/03/30 06:46:35] 根信息:shuffle:假 [2021/03/30 06:46:35] 根信息:全局: [2021/03/30 06:46:35] 根信息: cal_metric_during_train : True [2021/03/30 06:46:35] root INFO: character_dict_path : ../dataset/rec_set/dict.txt [2021/03/30 06:46:35] root INFO: character_type : ch [2021] /03/30 06:46:35] 根信息:检查点:无 [2021/03/30 06:46:35] 根信息:调试:假 [2021/03/30 06:46:35] 根信息:分布式:假 [2021/03/30 06:46:35] 根信息:epoch_num:100 [2021/03/30 06:46:35] 根信息:eval_batch_step:[0, 200000] [2021/03/30 0 46:35] 根信息:infer_img:doc/imgs_words/ch/word_1.jpg [2021/03/30 06:46:35] 根信息:infer_mode:假 [2021/03/30 06:46:35] 根信息:log_smooth_window:20 [2021/03/30 06:46:35] 根信息: max_text_length : 25 [2021/03/30 06:46:35] root INFO: pretrained_model : None [2021/03/30 06:46:35] root INFO: print_batch_step : 10 [2021/03/30: 06:46 35] 根信息:save_epoch_step:3 [2021/03/30 06:46:35] 根信息:save_inference_dir:无 [2021/03/30 06:46:35] 根信息:save_model_dir:./output/rec_chinese_common_common [2021/03/30 06:46:35] 根信息:use_gpu:真 [2021/03/30 06:46:35] 根信息:use_space_char:真 [2021/03/30 06:46:35] 根信息:use_visualdl :假 [2021/03/30 06:46:35] 根信息:损失: [2021/03/30 06:46:35] 根信息:名称:CTCLoss [2021/03/30 06:46:35] 根信息:指标: [2021/03/30 06:46:35] 根信息:main_indicator:acc [2021/03/30 06:46:35] 根信息:名称:RecMetric [2021/03/30 06:46:35]根信息:优化器: [2021/03/30 06:46:35] 根信息:beta1:0.9 [2021/03/30 06:46:35] 根信息:beta2:0.999 [2021/03/30 06:46] :35] root INFO: lr : [2021/03/30 06:46:35] root INFO: learning_rate : 0.001 [2021/03/30 06:46:35] root INFO: name : Cosine [2021/03/30 06:46:35] 根信息:姓名:亚当 [2021/03/30 06:46:35] 根信息:正则化器: [2021/03/30 06:46:35] 根信息:因子:4e-05 [2021/03/30 06:46:35] 根信息:名称:L2 [2021/03/30 06:46:35]根信息:后处理: [2021/03/30 06:46:35] 根信息:名称:CTCLabelDecode [2021/03/30 06:46:35] 根信息:火车: [2021/03/30 06:46: 35] 根信息:数据集: [2021/03/30 06:46:35] 根信息:data_dir:../ dataset/rec /train/ [2021/03/30 06:46:35] 根信息:label_file_list: ['../dataset/rec/train/label.txt'] [2021/03/30 06:46:35] root INFO: label_name : label.txt [2021/03/30 06:46:35] root INFO :名称:SimpleDataSet [2021/03/30 06:46:35] 根信息:转换: [2021/03/30 06:46:35] 根信息:DecodeImage: [2021/03/30 06:46:35] 根信息:channel_first:假 [2021/03/30 06:46:35] 根信息: img_mode:BGR [2021/03/30 06:46:35] 根信息:RecAug:无 [2021/03/30 06:46:35] 根信息:CTCLabelEncode:无 [2021/03/30 06:46:35] ] 根信息: RecResizeImg : [2021/03/30 06:46:35] 根信息: image_shape : [3, 32, 320] [2021/03/30 06:46:35] 根信息: KeepKeys : [2021/ 03/30 06:46:35] 根信息:keep_keys:['图像','标签','长度'] [2021/03/30 06:46:35] 根信息:加载器: [2021/03/30 06:46:35] 根信息:batch_size_per_card:16 [2021/03/30 06:46:35] root INFO: drop_last : True [2021/03/30 06:46:35] root INFO: num_workers : 8 [2021/03/30 06:46:35] root INFO : shuffle : True [2021/03/30 06:46:35] 根信息:用桨 2.0.0 和设备 CUDAPlace(0) 训练 [2021/03/30 06:46:35] 根信息:初始化数据集索引:['../dataset/rec/train/label.txt'] [2021/03/30 06:46:43] 根信息:初始化数据集索引:['../dataset/rec/eval/label. txt'] 3 W0330 06:46:43.302564 184 device_context.cc:362] 请注意:设备:0,GPU 计算能力:7.5,驱动程序 API 版本:11.2,运行时 API 版本:10.2 W0330 06:46:494 device_context0.cc cc:372] 设备:0,cuDNN 版本:8.0。 [2021/03/30 06:46:45] 根信息:从头开始训练 [2021/03/30 06:46:45] 根信息:训练数据加载器有 277816 个迭代,有效数据加载器有 11233 个迭代 [2021/03/30 06:46:45] root INFO:在训练过程中,在第 0 次迭代后,每 200000 次迭代运行一次评估 [2021/03/30 06:46:45] root INFO:初始化数据集索引:['.. /dataset/rec/train/label.txt']
C++ 回溯(最近一次调用最后一次):
0 paddle::framework::SignalHandle(char const*, int) 1 paddle::platform::GetCurrentTraceBackString abi:cxx11
错误信息摘要:
FatalError:
Segmentation fault
被操作系统检测到。 [TimeInfo: Aborted at 1617086816 (unix time) try "date -d @1617086816" if you are using GNU date ] [SignalInfo: SIGSEGV ( @0x0 ) 由 PID 184 (TID 0x7f93c3e3c740) 接收来自 PID 0 ]分段错误(核心转储) `
问一下 你这个问题你现在解决了嘛 我也遇到这个情况了
Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开(建议先拉取最新代码进行尝试),我们会继续跟进。
batch_size_per_card 只能调到8,改成16就报错 这是在8的时候显卡使用情况
报错如下
` def convert_to_list(value, n, name, dtype=np.int): [2021/03/30 06:46:35] root INFO: Architecture : [2021/03/30 06:46:35] root INFO: Backbone : [2021/03/30 06:46:35] root INFO: layers : 34 [2021/03/30 06:46:35] root INFO: name : ResNet [2021/03/30 06:46:35] root INFO: Head : [2021/03/30 06:46:35] root INFO: fc_decay : 4e-05 [2021/03/30 06:46:35] root INFO: name : CTCHead [2021/03/30 06:46:35] root INFO: Neck : [2021/03/30 06:46:35] root INFO: encoder_type : rnn [2021/03/30 06:46:35] root INFO: hidden_size : 128 [2021/03/30 06:46:35] root INFO: name : SequenceEncoder [2021/03/30 06:46:35] root INFO: Transform : None [2021/03/30 06:46:35] root INFO: algorithm : CRNN [2021/03/30 06:46:35] root INFO: model_type : rec [2021/03/30 06:46:35] root INFO: Eval : [2021/03/30 06:46:35] root INFO: dataset : [2021/03/30 06:46:35] root INFO: data_dir : ../dataset/rec/eval/ [2021/03/30 06:46:35] root INFO: label_file_list : ['../dataset/rec/eval/label.txt'] [2021/03/30 06:46:35] root INFO: label_name : label.txt [2021/03/30 06:46:35] root INFO: name : SimpleDataSet [2021/03/30 06:46:35] root INFO: transforms : [2021/03/30 06:46:35] root INFO: DecodeImage : [2021/03/30 06:46:35] root INFO: channel_first : False [2021/03/30 06:46:35] root INFO: img_mode : BGR [2021/03/30 06:46:35] root INFO: CTCLabelEncode : None [2021/03/30 06:46:35] root INFO: RecResizeImg : [2021/03/30 06:46:35] root INFO: image_shape : [3, 32, 320] [2021/03/30 06:46:35] root INFO: KeepKeys : [2021/03/30 06:46:35] root INFO: keep_keys : ['image', 'label', 'length'] [2021/03/30 06:46:35] root INFO: loader : [2021/03/30 06:46:35] root INFO: batch_size_per_card : 8 [2021/03/30 06:46:35] root INFO: drop_last : False [2021/03/30 06:46:35] root INFO: num_workers : 8 [2021/03/30 06:46:35] root INFO: shuffle : False [2021/03/30 06:46:35] root INFO: Global : [2021/03/30 06:46:35] root INFO: cal_metric_during_train : True [2021/03/30 06:46:35] root INFO: character_dict_path : ../dataset/rec_set/dict.txt [2021/03/30 06:46:35] root INFO: character_type : ch [2021/03/30 06:46:35] root INFO: checkpoints : None [2021/03/30 06:46:35] root INFO: debug : False [2021/03/30 06:46:35] root INFO: distributed : False [2021/03/30 06:46:35] root INFO: epoch_num : 100 [2021/03/30 06:46:35] root INFO: eval_batch_step : [0, 200000] [2021/03/30 06:46:35] root INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2021/03/30 06:46:35] root INFO: infer_mode : False [2021/03/30 06:46:35] root INFO: log_smooth_window : 20 [2021/03/30 06:46:35] root INFO: max_text_length : 25 [2021/03/30 06:46:35] root INFO: pretrained_model : None [2021/03/30 06:46:35] root INFO: print_batch_step : 10 [2021/03/30 06:46:35] root INFO: save_epoch_step : 3 [2021/03/30 06:46:35] root INFO: save_inference_dir : None [2021/03/30 06:46:35] root INFO: save_model_dir : ./output/rec_chinese_common_v2.0 [2021/03/30 06:46:35] root INFO: use_gpu : True [2021/03/30 06:46:35] root INFO: use_space_char : True [2021/03/30 06:46:35] root INFO: use_visualdl : False [2021/03/30 06:46:35] root INFO: Loss : [2021/03/30 06:46:35] root INFO: name : CTCLoss [2021/03/30 06:46:35] root INFO: Metric : [2021/03/30 06:46:35] root INFO: main_indicator : acc [2021/03/30 06:46:35] root INFO: name : RecMetric [2021/03/30 06:46:35] root INFO: Optimizer : [2021/03/30 06:46:35] root INFO: beta1 : 0.9 [2021/03/30 06:46:35] root INFO: beta2 : 0.999 [2021/03/30 06:46:35] root INFO: lr : [2021/03/30 06:46:35] root INFO: learning_rate : 0.001 [2021/03/30 06:46:35] root INFO: name : Cosine [2021/03/30 06:46:35] root INFO: name : Adam [2021/03/30 06:46:35] root INFO: regularizer : [2021/03/30 06:46:35] root INFO: factor : 4e-05 [2021/03/30 06:46:35] root INFO: name : L2 [2021/03/30 06:46:35] root INFO: PostProcess : [2021/03/30 06:46:35] root INFO: name : CTCLabelDecode [2021/03/30 06:46:35] root INFO: Train : [2021/03/30 06:46:35] root INFO: dataset : [2021/03/30 06:46:35] root INFO: data_dir : ../dataset/rec/train/ [2021/03/30 06:46:35] root INFO: label_file_list : ['../dataset/rec/train/label.txt'] [2021/03/30 06:46:35] root INFO: label_name : label.txt [2021/03/30 06:46:35] root INFO: name : SimpleDataSet [2021/03/30 06:46:35] root INFO: transforms : [2021/03/30 06:46:35] root INFO: DecodeImage : [2021/03/30 06:46:35] root INFO: channel_first : False [2021/03/30 06:46:35] root INFO: img_mode : BGR [2021/03/30 06:46:35] root INFO: RecAug : None [2021/03/30 06:46:35] root INFO: CTCLabelEncode : None [2021/03/30 06:46:35] root INFO: RecResizeImg : [2021/03/30 06:46:35] root INFO: image_shape : [3, 32, 320] [2021/03/30 06:46:35] root INFO: KeepKeys : [2021/03/30 06:46:35] root INFO: keep_keys : ['image', 'label', 'length'] [2021/03/30 06:46:35] root INFO: loader : [2021/03/30 06:46:35] root INFO: batch_size_per_card : 16 [2021/03/30 06:46:35] root INFO: drop_last : True [2021/03/30 06:46:35] root INFO: num_workers : 8 [2021/03/30 06:46:35] root INFO: shuffle : True [2021/03/30 06:46:35] root INFO: train with paddle 2.0.0 and device CUDAPlace(0) [2021/03/30 06:46:35] root INFO: Initialize indexs of datasets:['../dataset/rec/train/label.txt'] [2021/03/30 06:46:43] root INFO: Initialize indexs of datasets:['../dataset/rec/eval/label.txt'] 3 W0330 06:46:43.302564 184 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 10.2 W0330 06:46:43.305191 184 device_context.cc:372] device: 0, cuDNN Version: 8.0. [2021/03/30 06:46:45] root INFO: train from scratch [2021/03/30 06:46:45] root INFO: train dataloader has 277816 iters, valid dataloader has 11233 iters [2021/03/30 06:46:45] root INFO: During the training process, after the 0th iteration, an evaluation is run every 200000 iterations [2021/03/30 06:46:45] root INFO: Initialize indexs of datasets:['../dataset/rec/train/label.txt']
C++ Traceback (most recent call last):
0 paddle::framework::SignalHandle(char const*, int) 1 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
Error Message Summary:
FatalError:
Segmentation fault
is detected by the operating system. [TimeInfo: Aborted at 1617086816 (unix time) try "date -d @1617086816" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 184 (TID 0x7f93c3e3c740) from PID 0 ]Segmentation fault (core dumped) `