训练时出现 Floating point exception

sunzhaoyang commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Debian
版本号/Version：Paddle： PaddleOCR： 2.6 问题相关组件/Related components：
运行指令/Command Code：python3 tools/train.py -c configs/kie/vi_layoutxlm/test.yml
完整报错/Complete Error Message：

test.yml

Global:
  use_gpu: True
  epoch_num: &epoch_num 130
  log_smooth_window: 10
  print_batch_step: 10
  save_model_dir: ./output/re_vi_layoutxlm_xfund_zh
  save_epoch_step: 2000
  # evaluation is run every 10 iterations after the 0th iteration
  eval_batch_step: [ 1, 19 ]
  cal_metric_during_train: False
  save_inference_dir:
  use_visualdl: False
  seed: 2022
  infer_img: ppstructure/docs/kie/input/zh_val_21.jpg
  save_res_path: ./output/re/xfund_zh/with_gt
  kie_rec_model_dir:
  kie_det_model_dir:

Architecture:
  model_type: kie
  algorithm: &algorithm "LayoutXLM"
  Transform:
  Backbone:
    name: LayoutXLMForRe
    pretrained: True
    mode: vi
    checkpoints:

Loss:
  name: LossFromOutput
  key: loss
  reduction: mean

Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.999
  clip_norm: 10
  lr:
    learning_rate: 0.00005
    warmup_epoch: 10
  regularizer:
    name: L2
    factor: 0.00000

PostProcess:
  name: VQAReTokenLayoutLMPostProcess

Metric:
  name: VQAReTokenMetric
  main_indicator: hmean

Train:
  dataset:
    name: SimpleDataSet
    data_dir: train_data/dataset1
    label_file_list:
      - train_data/dataset1/train.json
    ratio_list: [ 1.0 ]
    transforms:
      - DecodeImage: # load image
          img_mode: RGB
          channel_first: False
      - VQATokenLabelEncode: # Class handling label
          contains_re: True
          algorithm: *algorithm
          class_path: &class_path train_data/dataset1/predefined_classes.txt
          use_textline_bbox_info: &use_textline_bbox_info True
          order_method: &order_method "tb-yx"
      - VQATokenPad:
          max_seq_len: &max_seq_len 512
          return_attention_mask: True
      - VQAReTokenRelation:
      - VQAReTokenChunk:
          max_seq_len: *max_seq_len
      - TensorizeEntitiesRelations:
      - Resize:
          size: [200,200]
      - NormalizeImage:
          scale: 1
          mean: [ 123.675, 116.28, 103.53 ]
          std: [ 58.395, 57.12, 57.375 ]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'input_ids', 'bbox','attention_mask', 'token_type_ids', 'entities', 'relations'] # dataloader will return list in this order
  loader:
    shuffle: True
    drop_last: False
    batch_size_per_card: 1
    num_workers: 1

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: train_data/dataset1
    label_file_list:
      - train_data/dataset1/val.json
    transforms:
      - DecodeImage: # load image
          img_mode: RGB
          channel_first: False
      - VQATokenLabelEncode: # Class handling label
          contains_re: True
          algorithm: *algorithm
          class_path: *class_path
          use_textline_bbox_info: *use_textline_bbox_info
          order_method: *order_method
      - VQATokenPad:
          max_seq_len: *max_seq_len
          return_attention_mask: True
      - VQAReTokenRelation:
      - VQAReTokenChunk:
          max_seq_len: *max_seq_len
      - TensorizeEntitiesRelations:
      - Resize:
          size: [200,200]
      - NormalizeImage:
          scale: 1
          mean: [ 123.675, 116.28, 103.53 ]
          std: [ 58.395, 57.12, 57.375 ]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'entities', 'relations'] # dataloader will return list in this order
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1
    num_workers: 1

执行后很快出现：

W1203 18:02:34.348244 36846 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1203 18:02:34.350412 36846 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2022/12/03 18:02:36] ppocr INFO: train dataloader has 70 iters
[2022/12/03 18:02:36] ppocr INFO: valid dataloader has 10 iters
[2022/12/03 18:02:36] ppocr INFO: During the training process, after the 1th iteration, an evaluation is run every 19 iterations
Floating point exception

如果把 gpu 改为 false

Traceback (most recent call last):
  File "tools/train.py", line 208, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 183, in main
    amp_level, amp_custom_black_list)
  File "/srv/PaddleOCR/tools/program.py", line 290, in train
    preds = model(batch)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/srv/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 86, in forward
    x = self.backbone(x)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/srv/PaddleOCR/ppocr/modeling/backbones/vqa_layoutlm.py", line 237, in forward
    relations=relations)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/modeling.py", line 1559, in forward
    relations)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/modeling.py", line 1425, in forward
    relations, entities = self.build_relation(relations, entities)
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddlenlp/transformers/layoutxlm/modeling.py", line 1374, in build_relation
    axis=1).tile([1, len(positive_relations), 1])
  File "/root/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/tensor/manipulation.py", line 3147, in tile
    return _C_ops.tile(x, repeat_times)
ValueError: (InvalidArgument) Every element of the input 'repeat_times' for tile op must be greater than 0, but the value given is 0.
  [Hint: Expected repeat_times_data[i] > 0, but received repeat_times_data[i]:0 <= 0:0.] (at /paddle/paddle/phi/infermeta/unary.cc:3743)

数据集是自己标记的,格式应该是对的吧。

train/IMG_8500.JPG  [{"transcription": "OD", "points": [[85, 277], [192, 277], [192, 331], [85, 331]], "difficult": true, "id": 100, "linking": [[1, 100], [2, 100], [3, 100]], "label": "QUESTION"}, {"transcription": "22.35 mm", "points": [[160, 492], [276, 492], [276, 522], [160, 522]], "difficult": true, "id": 1, "linking": [[100, 1]], "label": "ANSWER"}, {"transcription": "41.51/43.55 D", "points": [[149, 891], [315, 891], [315, 916], [149, 916]], "difficult": true, "id": 2, "linking": [[100, 2]], "label": "ANSWER"}, {"transcription": "K:-2.09DX179", "points": [[61, 973], [261, 973], [261, 1006], [61, 1006]], "difficult": true, "id": 3, "linking": [[100, 3]], "label": "ANSWER"}, {"transcription": "os", "points": [[1137, 293], [1249, 293], [1249, 351], [1137, 351]], "difficult": true, "id": 4, "linking": [[4, 5], [4, 6], [4, 7]], "label": "QUESTION"}, {"transcription": "22.25 mm", "points": [[790, 498], [903, 498], [903, 528], [790, 528]], "difficult": true, "id": 5, "linking": [[4, 5]], "label": "ANSWER"}, {"transcription": "41.82/44.23 D", "points": [[763, 889], [914, 889], [914, 916], [763, 916]], "difficult": true, "id": 6, "linking": [[4, 6]], "label": "ANSWER"}, {"transcription": "K:-2.41DX11", "points": [[685, 971], [867, 971], [867, 1003], [685, 1003]], "difficult": true, "id": 7, "linking": [[4, 7]], "label": "ANSWER"}]
train/IMG_8502.JPG  [{"transcription": "OD", "points": [[165, 22], [277, 22], [277, 86], [165, 86]], "difficult": true, "id": 100, "linking": [[2, 100], [3, 100], [4, 100]], "label": "QUESTION"}, {"transcription": "os", "points": [[1080, 58], [1185, 58], [1185, 106], [1080, 106]], "difficult": true, "id": 1, "linking": [[1, 5], [1, 6], [1, 7]], "label": "QUESTION"}, {"transcription": "22.96 mm", "points": [[254, 210], [363, 210], [363, 241], [254, 241]], "difficult": true, "id": 2, "linking": [[100, 2]], "label": "ANSWER"}, {"transcription": "45.06/45.92 D", "points": [[178, 567], [337, 567], [337, 595], [178, 595]], "difficult": true, "id": 3, "linking": [[100, 3]], "label": "ANSWER"}, {"transcription": "D-0.92D@20", "points": [[88, 647], [295, 647], [295, 678], [88, 678]], "difficult": true, "id": 4, "linking": [[100, 4]], "label": "ANSWER"}, {"transcription": "23.16 mm", "points": [[804, 232], [910, 232], [910, 257], [804, 257]], "difficult": true, "id": 5, "linking": [[1, 5]], "label": "ANSWER"}, {"transcription": "44.64/45.86 D", "points": [[767, 571], [911, 571], [911, 595], [767, 595]], "difficult": true, "id": 6, "linking": [[1, 6]], "label": "ANSWER"}, {"transcription": "D:-1.22D@179", "points": [[692, 649], [880, 649], [880, 675], [692, 675]], "difficult": true, "id": 7, "linking": [[1, 7]], "label": "ANSWER"}]

MissPenguin commented 1 year ago

直接跑原始数据和配置能跑起来吗，做了哪些改动呢

sunzhaoyang commented 1 year ago

也就是改用了自己标记的数据集，改了配置文件中的数据集路径，其他的没改。

用 XFUND 数据集也报错，所以是环境问题？

W1206 21:35:40.856792  1137 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1206 21:35:40.974639  1137 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2022/12/06 21:35:49] ppocr INFO: train dataloader has 75 iters
[2022/12/06 21:35:49] ppocr INFO: valid dataloader has 7 iters
[2022/12/06 21:35:49] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 19 iterations
Aborted

MissPenguin commented 1 year ago

嗯嗯，用xfun也报错的话，应该是环境问题，可以参考paddle官网，检查下paddle、cuda、cudnn等版本是否匹配，用check_install检查paddle是否正确安装

sunzhaoyang commented 1 year ago

@MissPenguin 各种尝试，实在找不到问题...看着各个组件的版本都是对的呀....

[2022/12/08 15:54:24] ppocr INFO: train with paddle 2.4.0 and device Place(gpu:0)
[2022/12/08 15:54:24] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_train/train.json']
[2022-12-08 15:54:25,492] [    INFO] - Already cached /root/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model
[2022-12-08 15:54:25,895] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json
[2022-12-08 15:54:25,895] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json
[2022/12/08 15:54:25] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_val/val.json']
[2022-12-08 15:54:25,896] [    INFO] - Already cached /root/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model
[2022-12-08 15:54:26,284] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json
[2022-12-08 15:54:26,284] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json
[2022-12-08 15:54:26,286] [    INFO] - Already cached /root/.paddlenlp/models/vi-layoutxlm-base-uncased/model_state.pdparams
W1208 15:54:26.287709 76091 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.2
W1208 15:54:26.289454 76091 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
Aborted

cuda: 11.2 driver: 460.80 对应 cuda 11.2 cudnn: cudnn-11.2-linux-x64-v8.1.0.77.tgz

Running verify PaddlePaddle program ...
W1208 15:58:59.182808 76181 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 11.2
W1208 15:58:59.185256 76181 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
PaddlePaddle works well on 1 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

root@2070:/opt# nvidia-smi
Thu Dec  8 15:58:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0 Off |                  N/A |
| 28%   49C    P0     1W / 175W |      0MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ariefwijaya commented 1 year ago

@sunzhaoyang How to solve the issue?

ghost commented 1 year ago

I am also getting the same output when I attempt to run RE training on Nvidia T4 on Ubuntu 20.04. Is there a workaround for this issue?

PaddlePaddle / PaddleOCR

训练时出现 Floating point exception #8527