PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Apache License 2.0
42.78k stars 7.69k forks source link

训练时出现 Floating point exception #8527

Closed sunzhaoyang closed 1 year ago

sunzhaoyang commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem


  use_gpu: True
  epoch_num: &epoch_num 130
  log_smooth_window: 10
  print_batch_step: 10
  save_model_dir: ./output/re_vi_layoutxlm_xfund_zh
  save_epoch_step: 2000
  # evaluation is run every 10 iterations after the 0th iteration
  eval_batch_step: [ 1, 19 ]
  cal_metric_during_train: False
  use_visualdl: False
  seed: 2022
  infer_img: ppstructure/docs/kie/input/zh_val_21.jpg
  save_res_path: ./output/re/xfund_zh/with_gt

  model_type: kie
  algorithm: &algorithm "LayoutXLM"
    name: LayoutXLMForRe
    pretrained: True
    mode: vi

  name: LossFromOutput
  key: loss
  reduction: mean

  name: AdamW
  beta1: 0.9
  beta2: 0.999
  clip_norm: 10
    learning_rate: 0.00005
    warmup_epoch: 10
    name: L2
    factor: 0.00000

  name: VQAReTokenLayoutLMPostProcess

  name: VQAReTokenMetric
  main_indicator: hmean

    name: SimpleDataSet
    data_dir: train_data/dataset1
      - train_data/dataset1/train.json
    ratio_list: [ 1.0 ]
      - DecodeImage: # load image
          img_mode: RGB
          channel_first: False
      - VQATokenLabelEncode: # Class handling label
          contains_re: True
          algorithm: *algorithm
          class_path: &class_path train_data/dataset1/predefined_classes.txt
          use_textline_bbox_info: &use_textline_bbox_info True
          order_method: &order_method "tb-yx"
      - VQATokenPad:
          max_seq_len: &max_seq_len 512
          return_attention_mask: True
      - VQAReTokenRelation:
      - VQAReTokenChunk:
          max_seq_len: *max_seq_len
      - TensorizeEntitiesRelations:
      - Resize:
          size: [200,200]
      - NormalizeImage:
          scale: 1
          mean: [ 123.675, 116.28, 103.53 ]
          std: [ 58.395, 57.12, 57.375 ]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'input_ids', 'bbox','attention_mask', 'token_type_ids', 'entities', 'relations'] # dataloader will return list in this order
    shuffle: True
    drop_last: False
    batch_size_per_card: 1
    num_workers: 1

    name: SimpleDataSet
    data_dir: train_data/dataset1
      - train_data/dataset1/val.json
      - DecodeImage: # load image
          img_mode: RGB
          channel_first: False
      - VQATokenLabelEncode: # Class handling label
          contains_re: True
          algorithm: *algorithm
          class_path: *class_path
          use_textline_bbox_info: *use_textline_bbox_info
          order_method: *order_method
      - VQATokenPad:
          max_seq_len: *max_seq_len
          return_attention_mask: True
      - VQAReTokenRelation:
      - VQAReTokenChunk:
          max_seq_len: *max_seq_len
      - TensorizeEntitiesRelations:
      - Resize:
          size: [200,200]
      - NormalizeImage:
          scale: 1
          mean: [ 123.675, 116.28, 103.53 ]
          std: [ 58.395, 57.12, 57.375 ]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'entities', 'relations'] # dataloader will return list in this order
    shuffle: False
    drop_last: False
    batch_size_per_card: 1
    num_workers: 1


W1203 18:02:34.348244 36846] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1203 18:02:34.350412 36846] device: 0, cuDNN Version: 7.6.
[2022/12/03 18:02:36] ppocr INFO: train dataloader has 70 iters
[2022/12/03 18:02:36] ppocr INFO: valid dataloader has 10 iters
[2022/12/03 18:02:36] ppocr INFO: During the training process, after the 1th iteration, an evaluation is run every 19 iterations
Floating point exception

如果把 gpu 改为 false

Traceback (most recent call last):
  File "tools/", line 208, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/", line 183, in main
    amp_level, amp_custom_black_list)
  File "/srv/PaddleOCR/tools/", line 290, in train
    preds = model(batch)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/srv/PaddleOCR/ppocr/modeling/architectures/", line 86, in forward
    x = self.backbone(x)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/srv/PaddleOCR/ppocr/modeling/backbones/", line 237, in forward
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/", line 1559, in forward
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/", line 1425, in forward
    relations, entities = self.build_relation(relations, entities)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/", line 1374, in build_relation
    axis=1).tile([1, len(positive_relations), 1])
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/tensor/", line 3147, in tile
    return _C_ops.tile(x, repeat_times)
ValueError: (InvalidArgument) Every element of the input 'repeat_times' for tile op must be greater than 0, but the value given is 0.
  [Hint: Expected repeat_times_data[i] > 0, but received repeat_times_data[i]:0 <= 0:0.] (at /paddle/paddle/phi/infermeta/


train/IMG_8500.JPG  [{"transcription": "OD", "points": [[85, 277], [192, 277], [192, 331], [85, 331]], "difficult": true, "id": 100, "linking": [[1, 100], [2, 100], [3, 100]], "label": "QUESTION"}, {"transcription": "22.35 mm", "points": [[160, 492], [276, 492], [276, 522], [160, 522]], "difficult": true, "id": 1, "linking": [[100, 1]], "label": "ANSWER"}, {"transcription": "41.51/43.55 D", "points": [[149, 891], [315, 891], [315, 916], [149, 916]], "difficult": true, "id": 2, "linking": [[100, 2]], "label": "ANSWER"}, {"transcription": "K:-2.09DX179", "points": [[61, 973], [261, 973], [261, 1006], [61, 1006]], "difficult": true, "id": 3, "linking": [[100, 3]], "label": "ANSWER"}, {"transcription": "os", "points": [[1137, 293], [1249, 293], [1249, 351], [1137, 351]], "difficult": true, "id": 4, "linking": [[4, 5], [4, 6], [4, 7]], "label": "QUESTION"}, {"transcription": "22.25 mm", "points": [[790, 498], [903, 498], [903, 528], [790, 528]], "difficult": true, "id": 5, "linking": [[4, 5]], "label": "ANSWER"}, {"transcription": "41.82/44.23 D", "points": [[763, 889], [914, 889], [914, 916], [763, 916]], "difficult": true, "id": 6, "linking": [[4, 6]], "label": "ANSWER"}, {"transcription": "K:-2.41DX11", "points": [[685, 971], [867, 971], [867, 1003], [685, 1003]], "difficult": true, "id": 7, "linking": [[4, 7]], "label": "ANSWER"}]
train/IMG_8502.JPG  [{"transcription": "OD", "points": [[165, 22], [277, 22], [277, 86], [165, 86]], "difficult": true, "id": 100, "linking": [[2, 100], [3, 100], [4, 100]], "label": "QUESTION"}, {"transcription": "os", "points": [[1080, 58], [1185, 58], [1185, 106], [1080, 106]], "difficult": true, "id": 1, "linking": [[1, 5], [1, 6], [1, 7]], "label": "QUESTION"}, {"transcription": "22.96 mm", "points": [[254, 210], [363, 210], [363, 241], [254, 241]], "difficult": true, "id": 2, "linking": [[100, 2]], "label": "ANSWER"}, {"transcription": "45.06/45.92 D", "points": [[178, 567], [337, 567], [337, 595], [178, 595]], "difficult": true, "id": 3, "linking": [[100, 3]], "label": "ANSWER"}, {"transcription": "D-0.92D@20", "points": [[88, 647], [295, 647], [295, 678], [88, 678]], "difficult": true, "id": 4, "linking": [[100, 4]], "label": "ANSWER"}, {"transcription": "23.16 mm", "points": [[804, 232], [910, 232], [910, 257], [804, 257]], "difficult": true, "id": 5, "linking": [[1, 5]], "label": "ANSWER"}, {"transcription": "44.64/45.86 D", "points": [[767, 571], [911, 571], [911, 595], [767, 595]], "difficult": true, "id": 6, "linking": [[1, 6]], "label": "ANSWER"}, {"transcription": "D:-1.22D@179", "points": [[692, 649], [880, 649], [880, 675], [692, 675]], "difficult": true, "id": 7, "linking": [[1, 7]], "label": "ANSWER"}]
MissPenguin commented 1 year ago


sunzhaoyang commented 1 year ago


用 XFUND 数据集也报错,所以是环境问题?

W1206 21:35:40.856792  1137] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1206 21:35:40.974639  1137] device: 0, cuDNN Version: 7.6.
[2022/12/06 21:35:49] ppocr INFO: train dataloader has 75 iters
[2022/12/06 21:35:49] ppocr INFO: valid dataloader has 7 iters
[2022/12/06 21:35:49] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 19 iterations
MissPenguin commented 1 year ago


sunzhaoyang commented 1 year ago

@MissPenguin 各种尝试,实在找不到问题...看着各个组件的版本都是对的呀....

[2022/12/08 15:54:24] ppocr INFO: train with paddle 2.4.0 and device Place(gpu:0)
[2022/12/08 15:54:24] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_train/train.json']
[2022-12-08 15:54:25,492] [    INFO] - Already cached /root/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model
[2022-12-08 15:54:25,895] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json
[2022-12-08 15:54:25,895] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json
[2022/12/08 15:54:25] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_val/val.json']
[2022-12-08 15:54:25,896] [    INFO] - Already cached /root/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model
[2022-12-08 15:54:26,284] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json
[2022-12-08 15:54:26,284] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json
[2022-12-08 15:54:26,286] [    INFO] - Already cached /root/.paddlenlp/models/vi-layoutxlm-base-uncased/model_state.pdparams
W1208 15:54:26.287709 76091] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.2
W1208 15:54:26.289454 76091] device: 0, cuDNN Version: 8.1.

cuda: 11.2 driver: 460.80 对应 cuda 11.2 cudnn: cudnn-11.2-linux-x64-v8.1.0.77.tgz

Running verify PaddlePaddle program ...
W1208 15:58:59.182808 76181] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.2
W1208 15:58:59.185256 76181] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
PaddlePaddle works well on 1 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
root@2070:/opt# nvidia-smi
Thu Dec  8 15:58:51 2022
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0 Off |                  N/A |
| 28%   49C    P0     1W / 175W |      0MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |
ariefwijaya commented 1 year ago

@sunzhaoyang How to solve the issue?

ghost commented 1 year ago

I am also getting the same output when I attempt to run RE training on Nvidia T4 on Ubuntu 20.04. Is there a workaround for this issue?