PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.75k stars 2.88k forks source link

Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan. #7622

Open lifw555 opened 1 year ago

lifw555 commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

配置文件内容:configs/ppyoloe_plus_crn_m_80e_coco.yml

_BASE_: [
  '/data/PaddleDetection/configs/datasets/coco_detection.yml',
  '/data/PaddleDetection/configs/runtime.yml',
  '/data/PaddleDetection/configs/ppyoloe/_base_/optimizer_80e.yml',
  '/data/PaddleDetection/configs/ppyoloe/_base_/ppyoloe_plus_crn.yml',
  '/data/PaddleDetection/configs/ppyoloe/_base_/ppyoloe_plus_reader.yml',
]

num_classes: 33

TrainDataset:
  !COCODataSet
    image_dir: train
    anno_path: annotations/train.json
    dataset_dir: /data/work/dataset
    data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']

EvalDataset:
  !COCODataSet
    image_dir: val
    anno_path: annotations/val.json
    dataset_dir: /data/work/dataset

TestDataset:
  !ImageFolder
    anno_path: annotations/val.json
    dataset_dir: /data/work/dataset

TrainReader:
  batch_size: 8

EvalReader:
  batch_size: 2

log_iter: 50 #100
save_dir: /data/work/output
snapshot_epoch: 5

epoch: 70 #80

LearningRate:
  base_lr: 0.0000625 #0.0000125 #0.001

weights: /data/work/output/ppyoloe_plus_crn_m_80e_coco/model_final

pretrain_weights: https://paddledet.bj.bcebos.com/models/ppyoloe_plus_crn_m_80e_coco.pdparams

depth_mult: 0.67
width_mult: 0.75

执行命令:

export CUDA_VISIBLE_DEVICES=0
python tools/train.py -c configs/ppyoloe_plus_crn_m_80e_coco.yml --amp --eval --use_vdl=true --vdl_log_dir=/data/work/option-number/logs

报错信息如下:

Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
-------
-------
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.

Traceback (most recent call last):
  File "tools/train.py", line 172, in <module>
    main()
  File "tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/data/PaddleDetection/ppdet/engine/trainer.py", line 485, in train
    outputs = model(data)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 124, in get_loss
    return self._forward()
  File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward
    return self.forward_train(feats, targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train
    ], targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 356, in get_loss
    assigned_scores_sum)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 269, in _bbox_loss
    if num_pos > 0:
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

执行

export FLAGS_check_nan_inf=1

输出内容:

[01/16 15:08:41] ppdet.engine INFO: Epoch: [4] [100/827] learning_rate: 0.000052 loss: 1.833632 loss_cls: 0.953054 loss_iou: 0.158906 loss_dfl: 0.899485 loss_l1: 0.325665 eta: 3:05:50 batch_cost: 0.1956 data_cost: 0.0002 ips: 40.9046 images/s
[01/16 15:08:52] ppdet.engine INFO: Epoch: [4] [150/827] learning_rate: 0.000052 loss: 1.754388 loss_cls: 0.963562 loss_iou: 0.153431 loss_dfl: 0.845342 loss_l1: 0.293998 eta: 3:05:34 batch_cost: 0.1968 data_cost: 0.0002 ips: 40.6550 images/s
numel:648 idx:544 value:23.359375
numel:648 idx:545 value:-18.828125
numel:648 idx:546 value:-25.531250
numel:648 idx:27 value:-inf
numel:648 idx:28 value:-inf
numel:648 idx:351 value:-inf
In block 0, there has 0,54,594 nan,inf,num
Error: /paddle/paddle/fluid/framework/details/nan_inf_utils_detail.cu:105 Assertion `false` failed. ===ERROR: in [op=conv2d_grad] [tensor=] find nan or inf===
Traceback (most recent call last):
  File "tools/train.py", line 172, in <module>
    main()
  File "tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/data/PaddleDetection/ppdet/engine/trainer.py", line 491, in train
    scaler.minimize(self.optimizer, scaled_loss)
  File "/usr/local/lib/python3.7/dist-packages/paddle/amp/grad_scaler.py", line 157, in minimize
    return super(GradScaler, self).minimize(optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/loss_scaler.py", line 222, in minimize
    self._unscale(optimizer)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/amp/loss_scaler.py", line 310, in _unscale
    self._found_inf = self._temp_found_inf_fp16 or self._temp_found_inf_fp32
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

复现环境 Environment

os: ubuntu 20.04 docker image : paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4 单卡, NVIDIA GeForce RTX 2080 Ti ,11G显存。 paddlepaddle:2.4.1 PaddleDetection:2.5.0

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

ghostxsl commented 1 year ago

你先试一下不用amp训练看看

HBUT-CV commented 1 year ago

按照显卡数量和bs调整学习率可以解决

lifw555 commented 1 year ago

按照显卡数量和bs调整学习率可以解决

我贴的配置,就是调整过的。

lifw555 commented 1 year ago

你先试一下不用amp训练看看

@ghostxsl ,按你说的,去掉amp,还是报一样的错误。

ghostxsl commented 1 year ago

那应该是paddle框架算子的bug,你换个paddle + python的版本试一下

ghostxsl commented 1 year ago

可能是paddle框架与不同平台兼容性有问题,可以参考 #6723

lifw555 commented 1 year ago

不行,我更换到python3.9,也是报类似的错误。

ghostxsl commented 1 year ago

https://github.com/PaddlePaddle/PaddleDetection/issues/6723#issuecomment-1326083748 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug

lifw555 commented 1 year ago

#6723 (comment) 你先试下这里的单测用例,看看是否在你的环境下也会出现类似的bug

我用这个代码测试了,没有得出同样的输出信息。多次执行,只返回如下信息:

Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [0, 1, 2])

看错误信息,他们好像不是一个地方产生的。

lifw555 commented 1 year ago

是不是和这个参数有关?

默认的是: pretrain_weights: https://bj.bcebos.com/v1/paddledet/models/pretrained/ppyoloe_crn_s_obj365_pretrained.pdparams

我使用的是: pretrain_weights: https://paddledet.bj.bcebos.com/models/ppyoloe_plus_crn_m_80e_coco.pdparams

lifw555 commented 1 year ago

我尝试在aistudio上执行,目前还没报错。

aistudio上的cuda版本是11.2。 我估计是paddlepaddle和11.7的兼容问题。

等我在aistudio上跑完看看,是否还报错,如果不报错,我再降级我自己的环境试试。

lifw555 commented 1 year ago

我测试了,同样的数据集和配置参数

在aistudio上完全正常的跑完。

aistudio的参数:

aistudio@jupyter-2276827-4958141:~$ nvidia-smi 
Wed Jan 18 09:09:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   37C    P0    53W / 300W |    763MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
aistudio@jupyter-2276827-4958141:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-bf77909a-5ace-6815-3a98-7b575241c3bf)
aistudio@jupyter-2276827-4958141:~$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
aistudio@jupyter-2276827-4958141:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
lifw555 commented 1 year ago

pip list

aistudio@jupyter-2276827-4958141:~$ pip list
Package                        Version
------------------------------ ---------------
absl-py                        0.8.1
alembic                        1.8.1
altair                         4.2.0
anyio                          3.6.1
argon2-cffi                    21.3.0
argon2-cffi-bindings           21.2.0
aspy.yaml                      1.3.0
astor                          0.8.1
astroid                        2.4.1
async-generator                1.10
attrs                          22.1.0
audioread                      2.1.8
autopep8                       1.6.0
Babel                          2.8.0
backcall                       0.1.0
backports.zoneinfo             0.2.1
bce-python-sdk                 0.8.53
beautifulsoup4                 4.11.1
bleach                         5.0.1
blinker                        1.5
cachetools                     4.0.0
certifi                        2019.9.11
certipy                        0.1.3
cffi                           1.15.1
cfgv                           2.0.1
chardet                        3.0.4
click                          8.0.4
cloudpickle                    1.6.0
cma                            2.7.0
colorama                       0.4.4
colorlog                       4.1.0
commonmark                     0.9.1
cryptography                   38.0.1
cycler                         0.10.0
Cython                         0.29
debugpy                        1.6.0
decorator                      4.4.2
defusedxml                     0.7.1
dill                           0.3.3
easydict                       1.9
entrypoints                    0.4
et-xmlfile                     1.0.1
fastjsonschema                 2.16.1
filelock                       3.0.12
filterpy                       1.4.5
fire                           0.5.0
flake8                         4.0.1
Flask                          1.1.1
Flask-Babel                    1.0.0
Flask-Cors                     3.0.8
forbiddenfruit                 0.1.3
funcsigs                       1.0.2
future                         0.18.0
gast                           0.3.3
gitdb                          4.0.5
GitPython                      3.1.14
google-auth                    1.10.0
google-auth-oauthlib           0.4.1
graphviz                       0.13
greenlet                       1.1.3
grpcio                         1.35.0
gunicorn                       20.0.4
gym                            0.12.1
h5py                           2.9.0
identify                       1.4.10
idna                           2.8
imageio                        2.6.1
imageio-ffmpeg                 0.3.0
importlib-metadata             4.2.0
importlib-resources            5.9.0
ipykernel                      6.9.1
ipython                        7.34.0
ipython-genutils               0.2.0
ipywidgets                     7.6.5
isort                          4.3.21
itsdangerous                   1.1.0
jdcal                          1.4.1
jedi                           0.17.2
jieba                          0.42.1
Jinja2                         3.0.0
joblib                         0.14.1
JPype1                         0.7.2
json5                          0.9.5
jsonschema                     4.16.0
jupyter-archive                3.2.1
jupyter_client                 7.3.5
jupyter-core                   4.11.1
jupyter-lsp                    1.5.1
jupyter-server                 1.16.0
jupyter-telemetry              0.1.0
jupyterhub                     1.3.0
jupyterlab                     3.4.5
jupyterlab-language-pack-zh-CN 3.4.post1
jupyterlab-pygments            0.2.2
jupyterlab-server              2.10.3
jupyterlab-widgets             3.0.3
kiwisolver                     1.1.0
lap                            0.4.0
lazy-object-proxy              1.4.3
librosa                        0.7.2
lightgbm                       3.1.1
llvmlite                       0.31.0
lxml                           4.9.1
Mako                           1.2.2
Markdown                       3.1.1
MarkupSafe                     2.0.1
matplotlib                     2.2.3
matplotlib-inline              0.1.6
mccabe                         0.6.1
mistune                        0.8.4
more-itertools                 7.2.0
motmetrics                     1.4.0
moviepy                        1.0.1
multiprocess                   0.70.11.1
nbclassic                      0.3.1
nbclient                       0.5.13
nbconvert                      6.4.4
nbformat                       5.5.0
nest-asyncio                   1.5.5
netifaces                      0.10.9
networkx                       2.4
nltk                           3.4.5
nodeenv                        1.3.4
notebook                       5.7.0
numba                          0.48.0
numpy                          1.19.5
oauthlib                       3.1.0
objgraph                       3.4.1
opencv-python                  4.6.0.66
openpyxl                       3.0.5
opt-einsum                     3.3.0
packaging                      21.3
paddle-bfloat                  0.1.7
paddle2onnx                    1.0.0
paddledet                      2.5.0
paddlefsl                      1.0.0
paddlehub                      2.3.0
paddlenlp                      2.1.1
paddlepaddle-gpu               2.3.2.post112
pamela                         1.0.0
pandas                         1.1.5
pandocfilters                  1.5.0
parl                           1.4.1
parso                          0.7.1
pathlib                        1.0.1
pexpect                        4.7.0
pickleshare                    0.7.5
Pillow                         8.2.0
pip                            22.1.2
pkgutil_resolve_name           1.3.10
plotly                         5.8.0
pluggy                         1.0.0
pre-commit                     1.21.0
prettytable                    0.7.2
proglog                        0.1.9
prometheus-client              0.14.1
prompt-toolkit                 2.0.10
protobuf                       3.20.0
psutil                         5.7.2
ptyprocess                     0.7.0
py4j                           0.10.9.2
pyarrow                        10.0.1
pyasn1                         0.4.8
pyasn1-modules                 0.2.7
pybboxes                       0.1.1
pyclipper                      1.3.0.post4
pycocotools                    2.0.6
pycodestyle                    2.8.0
pycparser                      2.21
pycryptodome                   3.9.9
pydeck                         0.8.0
pydocstyle                     5.0.2
pyflakes                       2.4.0
pyglet                         1.4.5
Pygments                       2.13.0
pyhumps                        3.8.0
pylint                         2.5.2
Pympler                        1.0.1
pynvml                         8.0.4
pyOpenSSL                      22.0.0
pyparsing                      3.0.9
pypmml                         0.9.11
pyrsistent                     0.18.1
python-dateutil                2.8.2
python-json-logger             2.0.4
python-jsonrpc-server          0.3.4
python-language-server         0.33.0
python-lsp-jsonrpc             1.0.0
python-lsp-server              1.5.0
pytz                           2019.3
pytz-deprecation-shim          0.1.0.post0
PyYAML                         5.1.2
pyzmq                          23.2.1
rarfile                        3.1
recordio                       0.1.7
requests                       2.24.0
requests-oauthlib              1.3.0
resampy                        0.2.2
rich                           12.6.0
rope                           0.17.0
rsa                            4.0
ruamel.yaml                    0.17.21
ruamel.yaml.clib               0.2.6
sahi                           0.10.1
scikit-learn                   0.24.2
scipy                          1.6.3
seaborn                        0.10.0
semver                         2.13.0
Send2Trash                     1.8.0
sentencepiece                  0.1.96
seqeval                        1.2.2
setuptools                     56.2.0
shapely                        2.0.0
shellcheck-py                  0.7.1.1
simplegeneric                  0.8.1
six                            1.16.0
sklearn                        0.0
smmap                          3.0.5
sniffio                        1.3.0
snowballstemmer                2.0.0
SoundFile                      0.10.3.post1
soupsieve                      2.3.2.post1
SQLAlchemy                     1.4.41
streamlit                      1.13.0
streamlit-image-comparison     0.0.3
tabulate                       0.8.3
tb-nightly                     1.15.0a20190801
tb-paddle                      0.3.6
tenacity                       8.0.1
tensorboard                    2.1.0
tensorboardX                   1.8
termcolor                      1.1.0
terminado                      0.15.0
terminaltables                 3.1.10
testpath                       0.4.2
threadpoolctl                  2.1.0
tinycss2                       1.1.1
toml                           0.10.0
toolz                          0.12.0
tornado                        6.2
tqdm                           4.64.1
traitlets                      5.4.0
typed-ast                      1.4.1
typeguard                      3.0.0b2
typing_extensions              4.3.0
tzdata                         2022.7
tzlocal                        4.2
ujson                          1.35
urllib3                        1.25.6
validators                     0.20.0
virtualenv                     16.7.9
visualdl                       2.4.0
watchdog                       2.2.0
wcwidth                        0.1.7
webencodings                   0.5.1
websocket-client               1.4.1
Werkzeug                       0.16.0
whatthepatch                   1.0.2
wheel                          0.36.2
widgetsnbextension             3.5.2
wrapt                          1.12.1
xarray                         0.16.2
xgboost                        1.3.3
xlrd                           1.2.0
xmltodict                      0.13.0
yapf                           0.26.0
zipp                           3.8.1

[notice] A new release of pip available: 22.1.2 -> 22.3.1
[notice] To update, run: pip install --upgrade pip
lifw555 commented 1 year ago

本地cnda 换成 11.2 ,依旧报错

Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion `(x >= static_cast<T>(0)) && (x <= one)` failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
  File "tools/train.py", line 172, in <module>
    main()
  File "tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "/data/PaddleDetection/ppdet/engine/trainer.py", line 485, in train
    outputs = model(data)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 124, in get_loss
    return self._forward()
  File "/data/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 219, in forward
    return self.forward_train(feats, targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 164, in forward_train
    ], targets)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 356, in get_loss
    assigned_scores_sum)
  File "/data/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 269, in _bbox_loss
    if num_pos > 0:
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 680, in __bool__
    return self.__nonzero__()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 673, in __nonzero__
    return bool(np.all(self.numpy() > 0))
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)