PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.75k stars 2.88k forks source link

Error: /paddle/paddle/phi/kernels/gpu/ Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan. #7622

Open lifw555 opened 1 year ago

lifw555 commented 1 year ago

_BASE_: [

num_classes: 33

    image_dir: train
    anno_path: annotations/train.json
    dataset_dir: /data/work/dataset
    data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']

    image_dir: val
    anno_path: annotations/val.json
    dataset_dir: /data/work/dataset

    anno_path: annotations/val.json
    dataset_dir: /data/work/dataset

  batch_size: 8

  batch_size: 2

log_iter: 50 #100
save_dir: /data/work/output
snapshot_epoch: 5

epoch: 70 #80

  base_lr: 0.0000625 #0.0000125 #0.001

weights: /data/work/output/ppyoloe_plus_crn_m_80e_coco/model_final


depth_mult: 0.67
width_mult: 0.75


python tools/ -c configs/ppyoloe_plus_crn_m_80e_coco.yml --amp --eval --use_vdl=true --vdl_log_dir=/data/work/option-number/logs


Error: /paddle/paddle/phi/kernels/gpu/ Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
  File "tools/", line 172, in <module>
  File "tools/", line 168, in main
    run(FLAGS, cfg)
  File "tools/", line 132, in run
  File "/data/PaddleDetection/ppdet/engine/", line 485, in train
    outputs = model(data)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 59, in forward
    out = self.get_loss()
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 124, in get_loss
    return self._forward()
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 219, in forward
    return self.forward_train(feats, targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 164, in forward_train
    ], targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 356, in get_loss
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 269, in _bbox_loss
    if num_pos > 0:
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website ( to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/


export FLAGS_check_nan_inf=1


[01/16 15:08:41] ppdet.engine INFO: Epoch: [4] [100/827] learning_rate: 0.000052 loss: 1.833632 loss_cls: 0.953054 loss_iou: 0.158906 loss_dfl: 0.899485 loss_l1: 0.325665 eta: 3:05:50 batch_cost: 0.1956 data_cost: 0.0002 ips: 40.9046 images/s
[01/16 15:08:52] ppdet.engine INFO: Epoch: [4] [150/827] learning_rate: 0.000052 loss: 1.754388 loss_cls: 0.963562 loss_iou: 0.153431 loss_dfl: 0.845342 loss_l1: 0.293998 eta: 3:05:34 batch_cost: 0.1968 data_cost: 0.0002 ips: 40.6550 images/s
numel:648 idx:544 value:23.359375
numel:648 idx:545 value:-18.828125
numel:648 idx:546 value:-25.531250
numel:648 idx:27 value:-inf
numel:648 idx:28 value:-inf
numel:648 idx:351 value:-inf
In block 0, there has 0,54,594 nan,inf,num
Error: /paddle/paddle/fluid/framework/details/ Assertion `false` failed. ===ERROR: in [op=conv2d_grad] [tensor=] find nan or inf===
Traceback (most recent call last):
  File "tools/", line 172, in <module>
  File "tools/", line 168, in main
    run(FLAGS, cfg)
  File "tools/", line 132, in run
  File "/data/PaddleDetection/ppdet/engine/", line 491, in train
    scaler.minimize(self.optimizer, scaled_loss)
  File "/usr/local/lib/python3.7/dist-packages/paddle/amp/", line 157, in minimize
    return super(GradScaler, self).minimize(optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/", line 222, in minimize
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/", line 310, in _unscale
    self._found_inf = self._temp_found_inf_fp16 or self._temp_found_inf_fp32
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website ( to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/

os: ubuntu 20.04 docker image : paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4 单卡, NVIDIA GeForce RTX 2080 Ti ,11G显存。 paddlepaddle:2.4.1 PaddleDetection:2.5.0

ghostxsl commented 1 year ago


HBUT-CV commented 1 year ago


lifw555 commented 1 year ago



lifw555 commented 1 year ago


@ghostxsl ,按你说的,去掉amp,还是报一样的错误。

ghostxsl commented 1 year ago

那应该是paddle框架算子的bug,你换个paddle + python的版本试一下

ghostxsl commented 1 year ago

可能是paddle框架与不同平台兼容性有问题,可以参考 #6723

lifw555 commented 1 year ago


ghostxsl commented 1 year ago 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug

lifw555 commented 1 year ago

#6723 (comment) 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug


Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [0, 1, 2])


lifw555 commented 1 year ago


默认的是: pretrain_weights:

我使用的是: pretrain_weights:

lifw555 commented 1 year ago


aistudio上的cuda版本是11.2。 我估计是paddlepaddle和11.7的兼容问题。


lifw555 commented 1 year ago




aistudio@jupyter-2276827-4958141:~$ nvidia-smi 
Wed Jan 18 09:09:08 2023       
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   37C    P0    53W / 300W |    763MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
aistudio@jupyter-2276827-4958141:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-bf77909a-5ace-6815-3a98-7b575241c3bf)
aistudio@jupyter-2276827-4958141:~$ cat /etc/*release
VERSION="16.04.6 LTS (Xenial Xerus)"
PRETTY_NAME="Ubuntu 16.04.6 LTS"
aistudio@jupyter-2276827-4958141:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
lifw555 commented 1 year ago

pip list

aistudio@jupyter-2276827-4958141:~$ pip list
[notice] A new release of pip available: 22.1.2 -> 22.3.1
[notice] To update, run: pip install --upgrade pip
lifw555 commented 1 year ago

本地cnda 换成 11.2 ,依旧报错

Error: /paddle/paddle/phi/kernels/gpu/ Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/ Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
  File "tools/", line 172, in <module>
  File "tools/", line 168, in main
    run(FLAGS, cfg)
  File "tools/", line 132, in run
  File "/data/PaddleDetection/ppdet/engine/", line 485, in train
    outputs = model(data)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 59, in forward
    out = self.get_loss()
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 124, in get_loss
    return self._forward()
  File "/data/PaddleDetection/ppdet/modeling/architectures/", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 219, in forward
    return self.forward_train(feats, targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 164, in forward_train
    ], targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 356, in get_loss
  File "/data/PaddleDetection/ppdet/modeling/heads/", line 269, in _bbox_loss
    if num_pos > 0:
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website ( to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/